Re: Beam dropping events from Kafka after reshuffle ?

2024-09-23 Thread marc hurabielle
Hello,

I am jumping on this, because we are doing same things as Lydian.
In our case, we are using default timestamp strategy in kafka (so
processing timestamp).
We were also doing same as Lydian to add processing timestamp manually.


However we have late data. It mainly happen in our integration test with
flink. (parallelism 1), and happen really rarely in production.

So it means we can't control the timestamp of an item even with
`window.TimestampedValue(event, time.time()))`?

Best,

Marc


On Tue, Sep 24, 2024, 04:23 Reuven Lax via user 
wrote:

> Also as I said, the duplicate files might not appear like duplicates,
> which can be quite confusing.
>
> Out of curiosity, I would try - just for testing_ remove the line where
> you "add" processing time, and also set allowed_lateness to the largest
> allowed value. This will help determine whether late data is causing the
> dropped records.
>
> Reuven
>
>
> On Mon, Sep 23, 2024 at 2:10 PM Lydian Lee 
> wrote:
>
>> Hi Jan,
>>
>> Thanks for taking a quick look!  Yes, the "with" statement would close
>> after the write.  In our use cases, we actually don't mind if there are
>> duplicates of data written, but we are more concerned about the missing
>> data, which is the issue we are facing right now.
>>
>> On Mon, Sep 23, 2024 at 11:54 AM Reuven Lax via user <
>> user@beam.apache.org> wrote:
>>
>>> Do you close the write afterwards? If not, I wonder if you could lose
>>> records due to unflushed data.
>>>
>>> Also - what you're doing here can definitely lead to duplicate data
>>> written, since this DoFn can be run multiple times. The duplicates might
>>> also appear different if the Iterables are slightly different on retries,
>>> especially in the case when Flink restarts a checkpoint.
>>>
>>> Reuven
>>>
>>> On Mon, Sep 23, 2024 at 1:40 PM Lydian Lee 
>>> wrote:
>>>
 Hi Jan,

 Thanks so much for your help. Here's our write to s3:

 from pyarrow.parquet import ParquetWriter

 class WriteBatchesToS3(beam.DoFn):
 def __init__(
 self,
 output_path: str,
 schema: pa.schema,
 pipeline_options: PipelineOptions,
 ) -> None:
 self.output_path = output_path
 self.schema = schema
 self.pipeline_options = pipeline_options

 def process(self, data: Iterable[List[Dict]]) -> None:
 """Write one batch per file to S3."""
 client = beam.io.aws.clients.s3.boto3_client.Client(options=self.
 pipeline_options)

 fields_without_metadata = [pa.field(f.name, f.type) for f in self.
 schema]
 schema_without_field_metadata = pa.schema(fields_without_metadata)

 filename = os.path.join(
 self.output_path,
 f"uuid_{str(uuid4())}.parquet",
 )

 tables = [pa.Table.from_pylist(batch, schema=
 schema_without_field_metadata) for batch in data]
 if len(tables) == 0:
 logging.info(f"No data to write for key: {partition_date}, the grouped
 contents are: {data}")
 return

 with beam.io.aws.s3io.S3IO(client=client).open(filename=filename, mode=
 "w") as s3_writer:
 with ParquetWriter(
 s3_writer, schema_without_field_metadata, compression="SNAPPY",
 use_deprecated_int96_timestamps=True
 ) as parquet_writer:
 merged_tables = pa.concat_tables(tables)
 parquet_writer.write_table(merged_tables)

 On Fri, Sep 20, 2024 at 12:02 AM Jan Lukavský  wrote:

> Hi Lydian,
>
> because you do not specify 'timestamp_policy' it should use the
> default, which should be processing time, so this should not be the issue.
> The one potentially left transform is the sink transform, as Reuven
> mentioned. Can you share details of the implementation?
>
>  Jan
> On 9/19/24 23:10, Lydian Lee wrote:
>
> Hi, Jan
>
> Here's how we do ReadFromKafka, the expansion service is just to
> ensure we can work with xlang in k8s, so please ignore them.
> from apache_beam.io.kafka import default_io_expansion_service
> ReadFromKafka(
> consumer_config={
> "group.id": "group-name",
> "auto.offset.reset": "latest",
> "enable.auto.commit": "false",
> },
> topics="topic-name",
> with_metadata=False,
> expansion_service=default_io_expansion_service(
> append_args=[
> f"--defaultEnvironmentType=PROCESS",
> f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}'
> ,
> "--experiments=use_deprecated_read",
> ]
> ),
> commit_offset_in_finalize=True,
> )
>
> Do you know what would be the right approach for using processing time
> instead? I thought the WindowInto supposed to use the timestamp we appened
> to the event?  Do you think it is still using the original Kafka event
> timestamp?  Thanks!
>
>
>
> On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:
>
>> Can you share the (relevant) parameters of the ReadFromKafka
>> transform?
>>
>> This feels st

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-23 Thread Reuven Lax via user
Also as I said, the duplicate files might not appear like duplicates, which
can be quite confusing.

Out of curiosity, I would try - just for testing_ remove the line where you
"add" processing time, and also set allowed_lateness to the largest allowed
value. This will help determine whether late data is causing the dropped
records.

Reuven


On Mon, Sep 23, 2024 at 2:10 PM Lydian Lee  wrote:

> Hi Jan,
>
> Thanks for taking a quick look!  Yes, the "with" statement would close
> after the write.  In our use cases, we actually don't mind if there are
> duplicates of data written, but we are more concerned about the missing
> data, which is the issue we are facing right now.
>
> On Mon, Sep 23, 2024 at 11:54 AM Reuven Lax via user 
> wrote:
>
>> Do you close the write afterwards? If not, I wonder if you could lose
>> records due to unflushed data.
>>
>> Also - what you're doing here can definitely lead to duplicate data
>> written, since this DoFn can be run multiple times. The duplicates might
>> also appear different if the Iterables are slightly different on retries,
>> especially in the case when Flink restarts a checkpoint.
>>
>> Reuven
>>
>> On Mon, Sep 23, 2024 at 1:40 PM Lydian Lee 
>> wrote:
>>
>>> Hi Jan,
>>>
>>> Thanks so much for your help. Here's our write to s3:
>>>
>>> from pyarrow.parquet import ParquetWriter
>>>
>>> class WriteBatchesToS3(beam.DoFn):
>>> def __init__(
>>> self,
>>> output_path: str,
>>> schema: pa.schema,
>>> pipeline_options: PipelineOptions,
>>> ) -> None:
>>> self.output_path = output_path
>>> self.schema = schema
>>> self.pipeline_options = pipeline_options
>>>
>>> def process(self, data: Iterable[List[Dict]]) -> None:
>>> """Write one batch per file to S3."""
>>> client = beam.io.aws.clients.s3.boto3_client.Client(options=self.
>>> pipeline_options)
>>>
>>> fields_without_metadata = [pa.field(f.name, f.type) for f in self.schema
>>> ]
>>> schema_without_field_metadata = pa.schema(fields_without_metadata)
>>>
>>> filename = os.path.join(
>>> self.output_path,
>>> f"uuid_{str(uuid4())}.parquet",
>>> )
>>>
>>> tables = [pa.Table.from_pylist(batch, schema=
>>> schema_without_field_metadata) for batch in data]
>>> if len(tables) == 0:
>>> logging.info(f"No data to write for key: {partition_date}, the grouped
>>> contents are: {data}")
>>> return
>>>
>>> with beam.io.aws.s3io.S3IO(client=client).open(filename=filename, mode=
>>> "w") as s3_writer:
>>> with ParquetWriter(
>>> s3_writer, schema_without_field_metadata, compression="SNAPPY",
>>> use_deprecated_int96_timestamps=True
>>> ) as parquet_writer:
>>> merged_tables = pa.concat_tables(tables)
>>> parquet_writer.write_table(merged_tables)
>>>
>>> On Fri, Sep 20, 2024 at 12:02 AM Jan Lukavský  wrote:
>>>
 Hi Lydian,

 because you do not specify 'timestamp_policy' it should use the
 default, which should be processing time, so this should not be the issue.
 The one potentially left transform is the sink transform, as Reuven
 mentioned. Can you share details of the implementation?

  Jan
 On 9/19/24 23:10, Lydian Lee wrote:

 Hi, Jan

 Here's how we do ReadFromKafka, the expansion service is just to ensure
 we can work with xlang in k8s, so please ignore them.
 from apache_beam.io.kafka import default_io_expansion_service
 ReadFromKafka(
 consumer_config={
 "group.id": "group-name",
 "auto.offset.reset": "latest",
 "enable.auto.commit": "false",
 },
 topics="topic-name",
 with_metadata=False,
 expansion_service=default_io_expansion_service(
 append_args=[
 f"--defaultEnvironmentType=PROCESS",
 f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
 "--experiments=use_deprecated_read",
 ]
 ),
 commit_offset_in_finalize=True,
 )

 Do you know what would be the right approach for using processing time
 instead? I thought the WindowInto supposed to use the timestamp we appened
 to the event?  Do you think it is still using the original Kafka event
 timestamp?  Thanks!



 On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:

> Can you share the (relevant) parameters of the ReadFromKafka transform?
>
> This feels strange, and it might not do what you'd expect:
> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda
> event: window.TimestampedValue(event, time.time()))
>
> This does not change the assigned timestamp of an element, but creates
> a new element which contains processing time. It will not be used for
> windowing, though.
> On 9/19/24 00:49, Lydian Lee wrote:
>
> Hi Reuven,
>
> Here's a quick look for our pipeline:
> (
> pipeline
> | "Reading message from Kafka" >> ReadFromKafka(...)
> | "Deserializing events" >> Deserialize(**deserializer_args)
> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda
> event: window.TimestampedValue(eve

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-23 Thread Lydian Lee
Hi Jan,

Thanks for taking a quick look!  Yes, the "with" statement would close
after the write.  In our use cases, we actually don't mind if there are
duplicates of data written, but we are more concerned about the missing
data, which is the issue we are facing right now.

On Mon, Sep 23, 2024 at 11:54 AM Reuven Lax via user 
wrote:

> Do you close the write afterwards? If not, I wonder if you could lose
> records due to unflushed data.
>
> Also - what you're doing here can definitely lead to duplicate data
> written, since this DoFn can be run multiple times. The duplicates might
> also appear different if the Iterables are slightly different on retries,
> especially in the case when Flink restarts a checkpoint.
>
> Reuven
>
> On Mon, Sep 23, 2024 at 1:40 PM Lydian Lee 
> wrote:
>
>> Hi Jan,
>>
>> Thanks so much for your help. Here's our write to s3:
>>
>> from pyarrow.parquet import ParquetWriter
>>
>> class WriteBatchesToS3(beam.DoFn):
>> def __init__(
>> self,
>> output_path: str,
>> schema: pa.schema,
>> pipeline_options: PipelineOptions,
>> ) -> None:
>> self.output_path = output_path
>> self.schema = schema
>> self.pipeline_options = pipeline_options
>>
>> def process(self, data: Iterable[List[Dict]]) -> None:
>> """Write one batch per file to S3."""
>> client = beam.io.aws.clients.s3.boto3_client.Client(options=self.
>> pipeline_options)
>>
>> fields_without_metadata = [pa.field(f.name, f.type) for f in self.schema]
>> schema_without_field_metadata = pa.schema(fields_without_metadata)
>>
>> filename = os.path.join(
>> self.output_path,
>> f"uuid_{str(uuid4())}.parquet",
>> )
>>
>> tables = [pa.Table.from_pylist(batch, schema=
>> schema_without_field_metadata) for batch in data]
>> if len(tables) == 0:
>> logging.info(f"No data to write for key: {partition_date}, the grouped
>> contents are: {data}")
>> return
>>
>> with beam.io.aws.s3io.S3IO(client=client).open(filename=filename, mode=
>> "w") as s3_writer:
>> with ParquetWriter(
>> s3_writer, schema_without_field_metadata, compression="SNAPPY",
>> use_deprecated_int96_timestamps=True
>> ) as parquet_writer:
>> merged_tables = pa.concat_tables(tables)
>> parquet_writer.write_table(merged_tables)
>>
>> On Fri, Sep 20, 2024 at 12:02 AM Jan Lukavský  wrote:
>>
>>> Hi Lydian,
>>>
>>> because you do not specify 'timestamp_policy' it should use the default,
>>> which should be processing time, so this should not be the issue. The one
>>> potentially left transform is the sink transform, as Reuven mentioned. Can
>>> you share details of the implementation?
>>>
>>>  Jan
>>> On 9/19/24 23:10, Lydian Lee wrote:
>>>
>>> Hi, Jan
>>>
>>> Here's how we do ReadFromKafka, the expansion service is just to ensure
>>> we can work with xlang in k8s, so please ignore them.
>>> from apache_beam.io.kafka import default_io_expansion_service
>>> ReadFromKafka(
>>> consumer_config={
>>> "group.id": "group-name",
>>> "auto.offset.reset": "latest",
>>> "enable.auto.commit": "false",
>>> },
>>> topics="topic-name",
>>> with_metadata=False,
>>> expansion_service=default_io_expansion_service(
>>> append_args=[
>>> f"--defaultEnvironmentType=PROCESS",
>>> f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
>>> "--experiments=use_deprecated_read",
>>> ]
>>> ),
>>> commit_offset_in_finalize=True,
>>> )
>>>
>>> Do you know what would be the right approach for using processing time
>>> instead? I thought the WindowInto supposed to use the timestamp we appened
>>> to the event?  Do you think it is still using the original Kafka event
>>> timestamp?  Thanks!
>>>
>>>
>>>
>>> On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:
>>>
 Can you share the (relevant) parameters of the ReadFromKafka transform?

 This feels strange, and it might not do what you'd expect:
 | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
 window.TimestampedValue(event, time.time()))

 This does not change the assigned timestamp of an element, but creates
 a new element which contains processing time. It will not be used for
 windowing, though.
 On 9/19/24 00:49, Lydian Lee wrote:

 Hi Reuven,

 Here's a quick look for our pipeline:
 (
 pipeline
 | "Reading message from Kafka" >> ReadFromKafka(...)
 | "Deserializing events" >> Deserialize(**deserializer_args)
 | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
 window.TimestampedValue(event, time.time()))
 | "Window into Fixed Intervals" >> beam.WindowInto(
 beam.transforms.window.FixedWindows(fixed_window_size), #
 fixed_window_size is 1 min.
 allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
 although we configured lateness, but because we are using processing time,
 i don't expect any late events
 )
 | "Adding random integer partition key" >> beam.Map(
 lambda event: (random.randint(1, 5), element) # add dummy key to
 reshuffle to less partitions.  Kafka have 

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-23 Thread Reuven Lax via user
Do you close the write afterwards? If not, I wonder if you could lose
records due to unflushed data.

Also - what you're doing here can definitely lead to duplicate data
written, since this DoFn can be run multiple times. The duplicates might
also appear different if the Iterables are slightly different on retries,
especially in the case when Flink restarts a checkpoint.

Reuven

On Mon, Sep 23, 2024 at 1:40 PM Lydian Lee  wrote:

> Hi Jan,
>
> Thanks so much for your help. Here's our write to s3:
>
> from pyarrow.parquet import ParquetWriter
>
> class WriteBatchesToS3(beam.DoFn):
> def __init__(
> self,
> output_path: str,
> schema: pa.schema,
> pipeline_options: PipelineOptions,
> ) -> None:
> self.output_path = output_path
> self.schema = schema
> self.pipeline_options = pipeline_options
>
> def process(self, data: Iterable[List[Dict]]) -> None:
> """Write one batch per file to S3."""
> client = beam.io.aws.clients.s3.boto3_client.Client(options=self.
> pipeline_options)
>
> fields_without_metadata = [pa.field(f.name, f.type) for f in self.schema]
> schema_without_field_metadata = pa.schema(fields_without_metadata)
>
> filename = os.path.join(
> self.output_path,
> f"uuid_{str(uuid4())}.parquet",
> )
>
> tables = [pa.Table.from_pylist(batch, schema=schema_without_field_metadata)
> for batch in data]
> if len(tables) == 0:
> logging.info(f"No data to write for key: {partition_date}, the grouped
> contents are: {data}")
> return
>
> with beam.io.aws.s3io.S3IO(client=client).open(filename=filename, mode="w")
> as s3_writer:
> with ParquetWriter(
> s3_writer, schema_without_field_metadata, compression="SNAPPY",
> use_deprecated_int96_timestamps=True
> ) as parquet_writer:
> merged_tables = pa.concat_tables(tables)
> parquet_writer.write_table(merged_tables)
>
> On Fri, Sep 20, 2024 at 12:02 AM Jan Lukavský  wrote:
>
>> Hi Lydian,
>>
>> because you do not specify 'timestamp_policy' it should use the default,
>> which should be processing time, so this should not be the issue. The one
>> potentially left transform is the sink transform, as Reuven mentioned. Can
>> you share details of the implementation?
>>
>>  Jan
>> On 9/19/24 23:10, Lydian Lee wrote:
>>
>> Hi, Jan
>>
>> Here's how we do ReadFromKafka, the expansion service is just to ensure
>> we can work with xlang in k8s, so please ignore them.
>> from apache_beam.io.kafka import default_io_expansion_service
>> ReadFromKafka(
>> consumer_config={
>> "group.id": "group-name",
>> "auto.offset.reset": "latest",
>> "enable.auto.commit": "false",
>> },
>> topics="topic-name",
>> with_metadata=False,
>> expansion_service=default_io_expansion_service(
>> append_args=[
>> f"--defaultEnvironmentType=PROCESS",
>> f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
>> "--experiments=use_deprecated_read",
>> ]
>> ),
>> commit_offset_in_finalize=True,
>> )
>>
>> Do you know what would be the right approach for using processing time
>> instead? I thought the WindowInto supposed to use the timestamp we appened
>> to the event?  Do you think it is still using the original Kafka event
>> timestamp?  Thanks!
>>
>>
>>
>> On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:
>>
>>> Can you share the (relevant) parameters of the ReadFromKafka transform?
>>>
>>> This feels strange, and it might not do what you'd expect:
>>> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
>>> window.TimestampedValue(event, time.time()))
>>>
>>> This does not change the assigned timestamp of an element, but creates a
>>> new element which contains processing time. It will not be used for
>>> windowing, though.
>>> On 9/19/24 00:49, Lydian Lee wrote:
>>>
>>> Hi Reuven,
>>>
>>> Here's a quick look for our pipeline:
>>> (
>>> pipeline
>>> | "Reading message from Kafka" >> ReadFromKafka(...)
>>> | "Deserializing events" >> Deserialize(**deserializer_args)
>>> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
>>> window.TimestampedValue(event, time.time()))
>>> | "Window into Fixed Intervals" >> beam.WindowInto(
>>> beam.transforms.window.FixedWindows(fixed_window_size), #
>>> fixed_window_size is 1 min.
>>> allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
>>> although we configured lateness, but because we are using processing time,
>>> i don't expect any late events
>>> )
>>> | "Adding random integer partition key" >> beam.Map(
>>> lambda event: (random.randint(1, 5), element) # add dummy key to
>>> reshuffle to less partitions.  Kafka have 16 partition, but we only want to
>>> generate 2 files every minute
>>> )
>>> | "Group by randomly-assigned integer key" >> beam.GroupByKey()
>>> | "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
>>> | "Writing event data batches to parquet" >> WriteBatchesToS3(...)  #
>>> call boto3 to write the events into S3 with parquet format
>>> )
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user <
>>> user@beam.apache.org> wrote:
>>>
 How are

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-23 Thread Lydian Lee
Hi Jan,

Thanks so much for your help. Here's our write to s3:

from pyarrow.parquet import ParquetWriter

class WriteBatchesToS3(beam.DoFn):
def __init__(
self,
output_path: str,
schema: pa.schema,
pipeline_options: PipelineOptions,
) -> None:
self.output_path = output_path
self.schema = schema
self.pipeline_options = pipeline_options

def process(self, data: Iterable[List[Dict]]) -> None:
"""Write one batch per file to S3."""
client = beam.io.aws.clients.s3.boto3_client.Client(options=self.
pipeline_options)

fields_without_metadata = [pa.field(f.name, f.type) for f in self.schema]
schema_without_field_metadata = pa.schema(fields_without_metadata)

filename = os.path.join(
self.output_path,
f"uuid_{str(uuid4())}.parquet",
)

tables = [pa.Table.from_pylist(batch, schema=schema_without_field_metadata)
for batch in data]
if len(tables) == 0:
logging.info(f"No data to write for key: {partition_date}, the grouped
contents are: {data}")
return

with beam.io.aws.s3io.S3IO(client=client).open(filename=filename, mode="w")
as s3_writer:
with ParquetWriter(
s3_writer, schema_without_field_metadata, compression="SNAPPY",
use_deprecated_int96_timestamps=True
) as parquet_writer:
merged_tables = pa.concat_tables(tables)
parquet_writer.write_table(merged_tables)

On Fri, Sep 20, 2024 at 12:02 AM Jan Lukavský  wrote:

> Hi Lydian,
>
> because you do not specify 'timestamp_policy' it should use the default,
> which should be processing time, so this should not be the issue. The one
> potentially left transform is the sink transform, as Reuven mentioned. Can
> you share details of the implementation?
>
>  Jan
> On 9/19/24 23:10, Lydian Lee wrote:
>
> Hi, Jan
>
> Here's how we do ReadFromKafka, the expansion service is just to ensure we
> can work with xlang in k8s, so please ignore them.
> from apache_beam.io.kafka import default_io_expansion_service
> ReadFromKafka(
> consumer_config={
> "group.id": "group-name",
> "auto.offset.reset": "latest",
> "enable.auto.commit": "false",
> },
> topics="topic-name",
> with_metadata=False,
> expansion_service=default_io_expansion_service(
> append_args=[
> f"--defaultEnvironmentType=PROCESS",
> f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
> "--experiments=use_deprecated_read",
> ]
> ),
> commit_offset_in_finalize=True,
> )
>
> Do you know what would be the right approach for using processing time
> instead? I thought the WindowInto supposed to use the timestamp we appened
> to the event?  Do you think it is still using the original Kafka event
> timestamp?  Thanks!
>
>
>
> On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:
>
>> Can you share the (relevant) parameters of the ReadFromKafka transform?
>>
>> This feels strange, and it might not do what you'd expect:
>> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
>> window.TimestampedValue(event, time.time()))
>>
>> This does not change the assigned timestamp of an element, but creates a
>> new element which contains processing time. It will not be used for
>> windowing, though.
>> On 9/19/24 00:49, Lydian Lee wrote:
>>
>> Hi Reuven,
>>
>> Here's a quick look for our pipeline:
>> (
>> pipeline
>> | "Reading message from Kafka" >> ReadFromKafka(...)
>> | "Deserializing events" >> Deserialize(**deserializer_args)
>> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
>> window.TimestampedValue(event, time.time()))
>> | "Window into Fixed Intervals" >> beam.WindowInto(
>> beam.transforms.window.FixedWindows(fixed_window_size), #
>> fixed_window_size is 1 min.
>> allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
>> although we configured lateness, but because we are using processing time,
>> i don't expect any late events
>> )
>> | "Adding random integer partition key" >> beam.Map(
>> lambda event: (random.randint(1, 5), element) # add dummy key to
>> reshuffle to less partitions.  Kafka have 16 partition, but we only want to
>> generate 2 files every minute
>> )
>> | "Group by randomly-assigned integer key" >> beam.GroupByKey()
>> | "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
>> | "Writing event data batches to parquet" >> WriteBatchesToS3(...)  #
>> call boto3 to write the events into S3 with parquet format
>> )
>>
>> Thanks!
>>
>>
>> On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user 
>> wrote:
>>
>>> How are you doing this aggregation?
>>>
>>> On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee 
>>> wrote:
>>>
 Hi Jan,

 Thanks for the recommendation. In our case, we are windowing with the
 processing time, which means that there should be no late event at all.

 You’ve mentioned that GroupByKey is stateful and can potentially drop
 the data. Given that after reshuffle (add random shuffle id to the key), we
 then do the aggregation (combine the data and write those data to S3.) Do
 you think the example I mentioned earlier could potentially be the reason
 for the dropping data?

 If so, i

Re: Support for Flink 1.19 or 1.20

2024-09-20 Thread XQ Hu via user
I suggest you open a feature request issue like
https://github.com/apache/beam/issues/30789 to track this request.

On Thu, Sep 19, 2024 at 1:20 PM Alek Mitrevski 
wrote:

> Is there a roadmap or support planned for Flink 1.19 or 1.20 releases?
>


Re: Beam dropping events from Kafka after reshuffle ?

2024-09-20 Thread Jan Lukavský

Hi Lydian,

because you do not specify 'timestamp_policy' it should use the default, 
which should be processing time, so this should not be the issue. The 
one potentially left transform is the sink transform, as Reuven 
mentioned. Can you share details of the implementation?


 Jan

On 9/19/24 23:10, Lydian Lee wrote:

Hi, Jan

Here's how we do ReadFromKafka, the expansion service is just to 
ensure we can work with xlang in k8s, so please ignore them.

from apache_beam.io.kafka import default_io_expansion_service
ReadFromKafka(
consumer_config={
"group.id ": "group-name",
"auto.offset.reset": "latest",
"enable.auto.commit": "false",
},
topics="topic-name",
with_metadata=False,
expansion_service=default_io_expansion_service(
append_args=[
f"--defaultEnvironmentType=PROCESS",
f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
"--experiments=use_deprecated_read",
]
),
commit_offset_in_finalize=True,
)

Do you know what would be the right approach for using processing time 
instead? I thought the WindowInto supposed to use the timestamp we 
appened to the event?  Do you think it is still using the original 
Kafka event timestamp?  Thanks!




On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:

Can you share the (relevant) parameters of the ReadFromKafka
transform?

This feels strange, and it might not do what you'd expect:

| "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda
event: window.TimestampedValue(event, time.time()))

This does not change the assigned timestamp of an element, but
creates a new element which contains processing time. It will not
be used for windowing, though.
On 9/19/24 00:49, Lydian Lee wrote:

Hi Reuven,

Here's a quick look for our pipeline:
(
pipeline
|"Reading message from Kafka">>ReadFromKafka(...)
| "Deserializing events" >> Deserialize(**deserializer_args)
| "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda
event: window.TimestampedValue(event, time.time()))
| "Window into Fixed Intervals" >> beam.WindowInto(
beam.transforms.window.FixedWindows(fixed_window_size), #
fixed_window_size is 1 min.
allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness),
# although we configured lateness, but because we are using
processing time, i don't expect any late events
)
| "Adding random integer partition key" >> beam.Map(
lambda event: (random.randint(1, 5), element) # add dummy key to
reshuffle to less partitions.  Kafka have 16 partition, but we
only want to generate 2 files every minute
)
| "Group by randomly-assigned integer key" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
| "Writing event data batches to parquet"
>> WriteBatchesToS3(...) # call boto3 to write the events into S3
with parquet format
)

Thanks!


On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user
 wrote:

How are you doing this aggregation?

On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee
 wrote:

Hi Jan,

Thanks for the recommendation. In our case, we are
windowing with the processing time, which means that
there should be no late event at all.

You’ve mentioned that GroupByKey is stateful and can
potentially drop the data. Given that after reshuffle
(add random shuffle id to the key), we then do the
aggregation (combine the data and write those data to
S3.) Do you think the example I mentioned earlier could
potentially be the reason for the dropping data?

If so, in general how does Beam being able to prevent
that ? Are there any suggested approaches? Thanks

On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský
 wrote:

Hi Lydian,

in that case, there is only a generic advice you can
look into. Reshuffle is a stateless operation that
should not cause dropping data. A GroupByKey on the
other hand is stateful and thus can - when dealing
with late data - drop some of them. You should be
able to confirm this looking for
'droppedDueToLateness' counter and/or log in here
[1]. This happens when elements arrive after
watermark passes element's timestamp minus allowed
lateness. If you see the log, you might need to
either change how you assign timestamps to elements
(e.g. use log append time) or increase allowed
lateness of your windowfn.

Best,

 Jan

[1]

https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDr

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-19 Thread Lydian Lee
Hi, Jan

Here's how we do ReadFromKafka, the expansion service is just to ensure we
can work with xlang in k8s, so please ignore them.
from apache_beam.io.kafka import default_io_expansion_service
ReadFromKafka(
consumer_config={
"group.id": "group-name",
"auto.offset.reset": "latest",
"enable.auto.commit": "false",
},
topics="topic-name",
with_metadata=False,
expansion_service=default_io_expansion_service(
append_args=[
f"--defaultEnvironmentType=PROCESS",
f'--defaultEnvironmentConfig={"command":"/opt/apache/beam/java_boot"}',
"--experiments=use_deprecated_read",
]
),
commit_offset_in_finalize=True,
)


Do you know what would be the right approach for using processing time
instead? I thought the WindowInto supposed to use the timestamp we appened
to the event?  Do you think it is still using the original Kafka event
timestamp?  Thanks!



On Thu, Sep 19, 2024 at 7:53 AM Jan Lukavský  wrote:

> Can you share the (relevant) parameters of the ReadFromKafka transform?
>
> This feels strange, and it might not do what you'd expect:
> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
> window.TimestampedValue(event, time.time()))
>
> This does not change the assigned timestamp of an element, but creates a
> new element which contains processing time. It will not be used for
> windowing, though.
>
> On 9/19/24 00:49, Lydian Lee wrote:
>
> Hi Reuven,
>
> Here's a quick look for our pipeline:
> (
> pipeline
> | "Reading message from Kafka" >> ReadFromKafka(...)
> | "Deserializing events" >> Deserialize(**deserializer_args)
> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
> window.TimestampedValue(event, time.time()))
> | "Window into Fixed Intervals" >> beam.WindowInto(
> beam.transforms.window.FixedWindows(fixed_window_size), #
> fixed_window_size is 1 min.
> allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
> although we configured lateness, but because we are using processing time,
> i don't expect any late events
> )
> | "Adding random integer partition key" >> beam.Map(
> lambda event: (random.randint(1, 5), element) # add dummy key to
> reshuffle to less partitions.  Kafka have 16 partition, but we only want to
> generate 2 files every minute
> )
> | "Group by randomly-assigned integer key" >> beam.GroupByKey()
> | "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
> | "Writing event data batches to parquet" >> WriteBatchesToS3(...)  #
> call boto3 to write the events into S3 with parquet format
> )
>
> Thanks!
>
>
> On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user 
> wrote:
>
>> How are you doing this aggregation?
>>
>> On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee 
>> wrote:
>>
>>> Hi Jan,
>>>
>>> Thanks for the recommendation. In our case, we are windowing with the
>>> processing time, which means that there should be no late event at all.
>>>
>>> You’ve mentioned that GroupByKey is stateful and can potentially drop
>>> the data. Given that after reshuffle (add random shuffle id to the key), we
>>> then do the aggregation (combine the data and write those data to S3.) Do
>>> you think the example I mentioned earlier could potentially be the reason
>>> for the dropping data?
>>>
>>> If so, in general how does Beam being able to prevent that ? Are there
>>> any suggested approaches? Thanks
>>>
>>> On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský  wrote:
>>>
 Hi Lydian,

 in that case, there is only a generic advice you can look into.
 Reshuffle is a stateless operation that should not cause dropping data. A
 GroupByKey on the other hand is stateful and thus can - when dealing with
 late data - drop some of them. You should be able to confirm this looking
 for 'droppedDueToLateness' counter and/or log in here [1]. This happens
 when elements arrive after watermark passes element's timestamp minus
 allowed lateness. If you see the log, you might need to either change how
 you assign timestamps to elements (e.g. use log append time) or increase
 allowed lateness of your windowfn.

 Best,

  Jan

 [1]
 https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132
 On 9/18/24 08:53, Lydian Lee wrote:

 I would love to, but there are some limitations on our ends that the
 version bump won’t be happened soon. Thus I need to figure out what might
 be the root cause though.


 On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:

> Hi Lydian,
>
> 2.41.0 is quite old, can you please try current version to see if this
> issue is still present? There were lots of changes between 2.41.0 and
> 2.59.0.
>
>  Jan
> On 9/17/24 17:49, Lydian Lee wrote:
>
> Hi,
>
> We are using Beam Python SDK with Flink Runner, the Beam version is
> 2.41.0 and the Flink version is 1.15.4.
>
> We 

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-19 Thread Jan Lukavský

Can you share the (relevant) parameters of the ReadFromKafka transform?

This feels strange, and it might not do what you'd expect:

| "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event: 
window.TimestampedValue(event, time.time()))


This does not change the assigned timestamp of an element, but creates a 
new element which contains processing time. It will not be used for 
windowing, though.


On 9/19/24 00:49, Lydian Lee wrote:

Hi Reuven,

Here's a quick look for our pipeline:
(
pipeline
|"Reading message from Kafka">>ReadFromKafka(...)
| "Deserializing events" >> Deserialize(**deserializer_args)
| "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda 
event: window.TimestampedValue(event, time.time()))

| "Window into Fixed Intervals" >> beam.WindowInto(
beam.transforms.window.FixedWindows(fixed_window_size), # 
fixed_window_size is 1 min.
allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), # 
although we configured lateness, but because we are using processing 
time, i don't expect any late events

)
| "Adding random integer partition key" >> beam.Map(
lambda event: (random.randint(1, 5), element) # add dummy key to 
reshuffle to less partitions.  Kafka have 16 partition, but we only 
want to generate 2 files every minute

)
| "Group by randomly-assigned integer key" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
| "Writing event data batches to parquet" >> WriteBatchesToS3(...) # 
call boto3 to write the events into S3 with parquet format

)

Thanks!


On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user 
 wrote:


How are you doing this aggregation?

On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee
 wrote:

Hi Jan,

Thanks for the recommendation. In our case, we are windowing
with the processing time, which means that there should be no
late event at all.

You’ve mentioned that GroupByKey is stateful and can
potentially drop the data. Given that after reshuffle (add
random shuffle id to the key), we then do the aggregation
(combine the data and write those data to S3.) Do you think
the example I mentioned earlier could potentially be the
reason for the dropping data?

If so, in general how does Beam being able to prevent that ?
Are there any suggested approaches? Thanks

On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský
 wrote:

Hi Lydian,

in that case, there is only a generic advice you can look
into. Reshuffle is a stateless operation that should not
cause dropping data. A GroupByKey on the other hand is
stateful and thus can - when dealing with late data - drop
some of them. You should be able to confirm this looking
for 'droppedDueToLateness' counter and/or log in here [1].
This happens when elements arrive after watermark passes
element's timestamp minus allowed lateness. If you see the
log, you might need to either change how you assign
timestamps to elements (e.g. use log append time) or
increase allowed lateness of your windowfn.

Best,

 Jan

[1]

https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132

On 9/18/24 08:53, Lydian Lee wrote:

I would love to, but there are some limitations on our
ends that the version bump won’t be happened soon. Thus I
need to figure out what might be the root cause though.


On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský
 wrote:

Hi Lydian,

2.41.0 is quite old, can you please try current
version to see if this issue is still present? There
were lots of changes between 2.41.0 and 2.59.0.

 Jan

On 9/17/24 17:49, Lydian Lee wrote:

Hi,

We are using Beam Python SDK with Flink Runner, the
Beam version is 2.41.0 and the Flink version is 1.15.4.

We have a pipeline that has 2 stages:
1. read from kafka and fixed window for every 1 minute
2. aggregate the data for the past 1 minute and
reshuffle so that we have less partition count and
write them into s3.

We disabled the enable.auto.commit and enabled
commit_offset_in_finalize. also the
auto.offset.reset is set to "latest"
image.png

According to the log, I can definitely find the data
is consuming from Kafka Offset, Because there are many
```
Resetting offset for topic -  to
offset

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-18 Thread Reuven Lax via user
If there are events later than allowed lateness, they might be dropped. I
know Dataflow has a metric that tracks records dropped due to allowed
lateness - I'm not sure if Flink does, but would be worth checking. you
could also set allowed lateness to the largest possible value to test if
that's what's causing issues here.

Another possibility - is WriteBatchesToS3 a transform you wrote? Sink
transforms can be tricky to write, and this is a potential place where data
might be lost.

On Wed, Sep 18, 2024 at 3:49 PM Lydian Lee  wrote:

> Hi Reuven,
>
> Here's a quick look for our pipeline:
> (
> pipeline
> | "Reading message from Kafka" >> ReadFromKafka(...)
> | "Deserializing events" >> Deserialize(**deserializer_args)
> | "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
> window.TimestampedValue(event, time.time()))
> | "Window into Fixed Intervals" >> beam.WindowInto(
> beam.transforms.window.FixedWindows(fixed_window_size), #
> fixed_window_size is 1 min.
> allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
> although we configured lateness, but because we are using processing time,
> i don't expect any late events
> )
> | "Adding random integer partition key" >> beam.Map(
> lambda event: (random.randint(1, 5), element) # add dummy key to
> reshuffle to less partitions.  Kafka have 16 partition, but we only want to
> generate 2 files every minute
> )
> | "Group by randomly-assigned integer key" >> beam.GroupByKey()
> | "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
> | "Writing event data batches to parquet" >> WriteBatchesToS3(...)  #
> call boto3 to write the events into S3 with parquet format
> )
>
> Thanks!
>
>
> On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user 
> wrote:
>
>> How are you doing this aggregation?
>>
>> On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee 
>> wrote:
>>
>>> Hi Jan,
>>>
>>> Thanks for the recommendation. In our case, we are windowing with the
>>> processing time, which means that there should be no late event at all.
>>>
>>> You’ve mentioned that GroupByKey is stateful and can potentially drop
>>> the data. Given that after reshuffle (add random shuffle id to the key), we
>>> then do the aggregation (combine the data and write those data to S3.) Do
>>> you think the example I mentioned earlier could potentially be the reason
>>> for the dropping data?
>>>
>>> If so, in general how does Beam being able to prevent that ? Are there
>>> any suggested approaches? Thanks
>>>
>>> On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský  wrote:
>>>
 Hi Lydian,

 in that case, there is only a generic advice you can look into.
 Reshuffle is a stateless operation that should not cause dropping data. A
 GroupByKey on the other hand is stateful and thus can - when dealing with
 late data - drop some of them. You should be able to confirm this looking
 for 'droppedDueToLateness' counter and/or log in here [1]. This happens
 when elements arrive after watermark passes element's timestamp minus
 allowed lateness. If you see the log, you might need to either change how
 you assign timestamps to elements (e.g. use log append time) or increase
 allowed lateness of your windowfn.

 Best,

  Jan

 [1]
 https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132
 On 9/18/24 08:53, Lydian Lee wrote:

 I would love to, but there are some limitations on our ends that the
 version bump won’t be happened soon. Thus I need to figure out what might
 be the root cause though.


 On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:

> Hi Lydian,
>
> 2.41.0 is quite old, can you please try current version to see if this
> issue is still present? There were lots of changes between 2.41.0 and
> 2.59.0.
>
>  Jan
> On 9/17/24 17:49, Lydian Lee wrote:
>
> Hi,
>
> We are using Beam Python SDK with Flink Runner, the Beam version is
> 2.41.0 and the Flink version is 1.15.4.
>
> We have a pipeline that has 2 stages:
> 1. read from kafka and fixed window for every 1 minute
> 2. aggregate the data for the past 1 minute and reshuffle so that we
> have less partition count and write them into s3.
>
> We disabled the enable.auto.commit and enabled
> commit_offset_in_finalize. also the auto.offset.reset is set to "latest"
> [image: image.png]
>
> According to the log, I can definitely find the data is consuming from
> Kafka Offset, Because there are many
> ```
> Resetting offset for topic -  to offset 
> ```
> and that partition/offset pair does match the missing records.
> However, it doesn't show up in the final S3.
>
> My current hypothesis is that the shuffling might be the reason for
> the issue, for example, originally in kafka

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-18 Thread Lydian Lee
Hi Reuven,

Here's a quick look for our pipeline:
(
pipeline
| "Reading message from Kafka" >> ReadFromKafka(...)
| "Deserializing events" >> Deserialize(**deserializer_args)
| "Adding 'trigger_processing_time' timestamp" >> beam.Map(lambda event:
window.TimestampedValue(event, time.time()))
| "Window into Fixed Intervals" >> beam.WindowInto(
beam.transforms.window.FixedWindows(fixed_window_size), # fixed_window_size
is 1 min.
allowed_lateness=beam.utils.timestamp.Duration(allowed_lateness), #
although we configured lateness, but because we are using processing time,
i don't expect any late events
)
| "Adding random integer partition key" >> beam.Map(
lambda event: (random.randint(1, 5), element) # add dummy key to reshuffle
to less partitions.  Kafka have 16 partition, but we only want to generate
2 files every minute
)
| "Group by randomly-assigned integer key" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda key, val: val)
| "Writing event data batches to parquet" >> WriteBatchesToS3(...)  # call
boto3 to write the events into S3 with parquet format
)

Thanks!


On Wed, Sep 18, 2024 at 3:16 PM Reuven Lax via user 
wrote:

> How are you doing this aggregation?
>
> On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee 
> wrote:
>
>> Hi Jan,
>>
>> Thanks for the recommendation. In our case, we are windowing with the
>> processing time, which means that there should be no late event at all.
>>
>> You’ve mentioned that GroupByKey is stateful and can potentially drop the
>> data. Given that after reshuffle (add random shuffle id to the key), we
>> then do the aggregation (combine the data and write those data to S3.) Do
>> you think the example I mentioned earlier could potentially be the reason
>> for the dropping data?
>>
>> If so, in general how does Beam being able to prevent that ? Are there
>> any suggested approaches? Thanks
>>
>> On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský  wrote:
>>
>>> Hi Lydian,
>>>
>>> in that case, there is only a generic advice you can look into.
>>> Reshuffle is a stateless operation that should not cause dropping data. A
>>> GroupByKey on the other hand is stateful and thus can - when dealing with
>>> late data - drop some of them. You should be able to confirm this looking
>>> for 'droppedDueToLateness' counter and/or log in here [1]. This happens
>>> when elements arrive after watermark passes element's timestamp minus
>>> allowed lateness. If you see the log, you might need to either change how
>>> you assign timestamps to elements (e.g. use log append time) or increase
>>> allowed lateness of your windowfn.
>>>
>>> Best,
>>>
>>>  Jan
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132
>>> On 9/18/24 08:53, Lydian Lee wrote:
>>>
>>> I would love to, but there are some limitations on our ends that the
>>> version bump won’t be happened soon. Thus I need to figure out what might
>>> be the root cause though.
>>>
>>>
>>> On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:
>>>
 Hi Lydian,

 2.41.0 is quite old, can you please try current version to see if this
 issue is still present? There were lots of changes between 2.41.0 and
 2.59.0.

  Jan
 On 9/17/24 17:49, Lydian Lee wrote:

 Hi,

 We are using Beam Python SDK with Flink Runner, the Beam version is
 2.41.0 and the Flink version is 1.15.4.

 We have a pipeline that has 2 stages:
 1. read from kafka and fixed window for every 1 minute
 2. aggregate the data for the past 1 minute and reshuffle so that we
 have less partition count and write them into s3.

 We disabled the enable.auto.commit and enabled
 commit_offset_in_finalize. also the auto.offset.reset is set to "latest"
 [image: image.png]

 According to the log, I can definitely find the data is consuming from
 Kafka Offset, Because there are many
 ```
 Resetting offset for topic -  to offset 
 ```
 and that partition/offset pair does match the missing records.
 However, it doesn't show up in the final S3.

 My current hypothesis is that the shuffling might be the reason for the
 issue, for example, originally in kafka for the past minute in partition
 1,  I have offset 1, 2, 3 records. After reshuffle, it now distribute, for
 example:
 - partition A: 1, 3
 - partition B: 2

 And if partition A is done successfully but partition B fails. Given
 that A is succeeded, it will commit its offset to Kafka, and thus kafka now
 has an offset to 3.  And when kafka retries , it will skip the offset 2.
  However, I am not sure how exactly the offset commit works, wondering how
 it interacts with the checkpoints.  But it does seem like if my hypothesis
 is correct, we should be seeing more missing records, however, this seems
 rare to happen.  Wonderi

Re: Beam dropping events from Kafka after reshuffle ?

2024-09-18 Thread Reuven Lax via user
How are you doing this aggregation?

On Wed, Sep 18, 2024 at 3:11 PM Lydian Lee  wrote:

> Hi Jan,
>
> Thanks for the recommendation. In our case, we are windowing with the
> processing time, which means that there should be no late event at all.
>
> You’ve mentioned that GroupByKey is stateful and can potentially drop the
> data. Given that after reshuffle (add random shuffle id to the key), we
> then do the aggregation (combine the data and write those data to S3.) Do
> you think the example I mentioned earlier could potentially be the reason
> for the dropping data?
>
> If so, in general how does Beam being able to prevent that ? Are there any
> suggested approaches? Thanks
>
> On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský  wrote:
>
>> Hi Lydian,
>>
>> in that case, there is only a generic advice you can look into. Reshuffle
>> is a stateless operation that should not cause dropping data. A GroupByKey
>> on the other hand is stateful and thus can - when dealing with late data -
>> drop some of them. You should be able to confirm this looking for
>> 'droppedDueToLateness' counter and/or log in here [1]. This happens when
>> elements arrive after watermark passes element's timestamp minus allowed
>> lateness. If you see the log, you might need to either change how you
>> assign timestamps to elements (e.g. use log append time) or increase
>> allowed lateness of your windowfn.
>>
>> Best,
>>
>>  Jan
>>
>> [1]
>> https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132
>> On 9/18/24 08:53, Lydian Lee wrote:
>>
>> I would love to, but there are some limitations on our ends that the
>> version bump won’t be happened soon. Thus I need to figure out what might
>> be the root cause though.
>>
>>
>> On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:
>>
>>> Hi Lydian,
>>>
>>> 2.41.0 is quite old, can you please try current version to see if this
>>> issue is still present? There were lots of changes between 2.41.0 and
>>> 2.59.0.
>>>
>>>  Jan
>>> On 9/17/24 17:49, Lydian Lee wrote:
>>>
>>> Hi,
>>>
>>> We are using Beam Python SDK with Flink Runner, the Beam version is
>>> 2.41.0 and the Flink version is 1.15.4.
>>>
>>> We have a pipeline that has 2 stages:
>>> 1. read from kafka and fixed window for every 1 minute
>>> 2. aggregate the data for the past 1 minute and reshuffle so that we
>>> have less partition count and write them into s3.
>>>
>>> We disabled the enable.auto.commit and enabled
>>> commit_offset_in_finalize. also the auto.offset.reset is set to "latest"
>>> [image: image.png]
>>>
>>> According to the log, I can definitely find the data is consuming from
>>> Kafka Offset, Because there are many
>>> ```
>>> Resetting offset for topic -  to offset 
>>> ```
>>> and that partition/offset pair does match the missing records.  However,
>>> it doesn't show up in the final S3.
>>>
>>> My current hypothesis is that the shuffling might be the reason for the
>>> issue, for example, originally in kafka for the past minute in partition
>>> 1,  I have offset 1, 2, 3 records. After reshuffle, it now distribute, for
>>> example:
>>> - partition A: 1, 3
>>> - partition B: 2
>>>
>>> And if partition A is done successfully but partition B fails. Given
>>> that A is succeeded, it will commit its offset to Kafka, and thus kafka now
>>> has an offset to 3.  And when kafka retries , it will skip the offset 2.
>>>  However, I am not sure how exactly the offset commit works, wondering how
>>> it interacts with the checkpoints.  But it does seem like if my hypothesis
>>> is correct, we should be seeing more missing records, however, this seems
>>> rare to happen.  Wondering if anyone can help identify potential
>>> root causes?  Thanks
>>>
>>>
>>>
>>>
>>>


Re: Beam dropping events from Kafka after reshuffle ?

2024-09-18 Thread Lydian Lee
Hi Jan,

Thanks for the recommendation. In our case, we are windowing with the
processing time, which means that there should be no late event at all.

You’ve mentioned that GroupByKey is stateful and can potentially drop the
data. Given that after reshuffle (add random shuffle id to the key), we
then do the aggregation (combine the data and write those data to S3.) Do
you think the example I mentioned earlier could potentially be the reason
for the dropping data?

If so, in general how does Beam being able to prevent that ? Are there any
suggested approaches? Thanks

On Wed, Sep 18, 2024 at 12:33 AM Jan Lukavský  wrote:

> Hi Lydian,
>
> in that case, there is only a generic advice you can look into. Reshuffle
> is a stateless operation that should not cause dropping data. A GroupByKey
> on the other hand is stateful and thus can - when dealing with late data -
> drop some of them. You should be able to confirm this looking for
> 'droppedDueToLateness' counter and/or log in here [1]. This happens when
> elements arrive after watermark passes element's timestamp minus allowed
> lateness. If you see the log, you might need to either change how you
> assign timestamps to elements (e.g. use log append time) or increase
> allowed lateness of your windowfn.
>
> Best,
>
>  Jan
>
> [1]
> https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132
> On 9/18/24 08:53, Lydian Lee wrote:
>
> I would love to, but there are some limitations on our ends that the
> version bump won’t be happened soon. Thus I need to figure out what might
> be the root cause though.
>
>
> On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:
>
>> Hi Lydian,
>>
>> 2.41.0 is quite old, can you please try current version to see if this
>> issue is still present? There were lots of changes between 2.41.0 and
>> 2.59.0.
>>
>>  Jan
>> On 9/17/24 17:49, Lydian Lee wrote:
>>
>> Hi,
>>
>> We are using Beam Python SDK with Flink Runner, the Beam version is
>> 2.41.0 and the Flink version is 1.15.4.
>>
>> We have a pipeline that has 2 stages:
>> 1. read from kafka and fixed window for every 1 minute
>> 2. aggregate the data for the past 1 minute and reshuffle so that we have
>> less partition count and write them into s3.
>>
>> We disabled the enable.auto.commit and enabled commit_offset_in_finalize.
>> also the auto.offset.reset is set to "latest"
>> [image: image.png]
>>
>> According to the log, I can definitely find the data is consuming from
>> Kafka Offset, Because there are many
>> ```
>> Resetting offset for topic -  to offset 
>> ```
>> and that partition/offset pair does match the missing records.  However,
>> it doesn't show up in the final S3.
>>
>> My current hypothesis is that the shuffling might be the reason for the
>> issue, for example, originally in kafka for the past minute in partition
>> 1,  I have offset 1, 2, 3 records. After reshuffle, it now distribute, for
>> example:
>> - partition A: 1, 3
>> - partition B: 2
>>
>> And if partition A is done successfully but partition B fails. Given that
>> A is succeeded, it will commit its offset to Kafka, and thus kafka now has
>> an offset to 3.  And when kafka retries , it will skip the offset 2.
>>  However, I am not sure how exactly the offset commit works, wondering how
>> it interacts with the checkpoints.  But it does seem like if my hypothesis
>> is correct, we should be seeing more missing records, however, this seems
>> rare to happen.  Wondering if anyone can help identify potential
>> root causes?  Thanks
>>
>>
>>
>>
>>


Re: Beam dropping events from Kafka after reshuffle ?

2024-09-18 Thread Jan Lukavský

Hi Lydian,

in that case, there is only a generic advice you can look into. 
Reshuffle is a stateless operation that should not cause dropping data. 
A GroupByKey on the other hand is stateful and thus can - when dealing 
with late data - drop some of them. You should be able to confirm this 
looking for 'droppedDueToLateness' counter and/or log in here [1]. This 
happens when elements arrive after watermark passes element's timestamp 
minus allowed lateness. If you see the log, you might need to either 
change how you assign timestamps to elements (e.g. use log append time) 
or increase allowed lateness of your windowfn.


Best,

 Jan

[1] 
https://github.com/apache/beam/blob/f37795e326a75310828518464189440b14863834/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L132


On 9/18/24 08:53, Lydian Lee wrote:
I would love to, but there are some limitations on our ends that the 
version bump won’t be happened soon. Thus I need to figure out what 
might be the root cause though.



On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:

Hi Lydian,

2.41.0 is quite old, can you please try current version to see if
this issue is still present? There were lots of changes between
2.41.0 and 2.59.0.

 Jan

On 9/17/24 17:49, Lydian Lee wrote:

Hi,

We are using Beam Python SDK with Flink Runner, the Beam version
is 2.41.0 and the Flink version is 1.15.4.

We have a pipeline that has 2 stages:
1. read from kafka and fixed window for every 1 minute
2. aggregate the data for the past 1 minute and reshuffle so that
we have less partition count and write them into s3.

We disabled the enable.auto.commit and enabled
commit_offset_in_finalize. also the auto.offset.reset is set to
"latest"
image.png

According to the log, I can definitely find the data is consuming
from Kafka Offset, Because there are many
```
Resetting offset for topic - to offset 
```
and that partition/offset pair does match the missing records. 
However, it doesn't show up in the final S3.

My current hypothesis is that the shuffling might be the reason
for the issue, for example, originally in kafka for the past
minute in partition 1,  I have offset 1, 2, 3 records. After
reshuffle, it now distribute, for example:
- partition A: 1, 3
- partition B: 2

And if partition A is done successfully but partition B fails.
Given that A is succeeded, it will commit its offset to Kafka,
and thus kafka now has an offset to 3.  And when kafka retries ,
it will skip the offset 2.   However, I am not sure how exactly
the offset commit works, wondering how it interacts with the
checkpoints.  But it does seem like if my hypothesis is correct,
we should be seeing more missing records, however, this seems
rare to happen.  Wondering if anyone can help identify potential
root causes?  Thanks





Re: Beam dropping events from Kafka after reshuffle ?

2024-09-17 Thread Lydian Lee
I would love to, but there are some limitations on our ends that the
version bump won’t be happened soon. Thus I need to figure out what might
be the root cause though.


On Tue, Sep 17, 2024 at 11:26 PM Jan Lukavský  wrote:

> Hi Lydian,
>
> 2.41.0 is quite old, can you please try current version to see if this
> issue is still present? There were lots of changes between 2.41.0 and
> 2.59.0.
>
>  Jan
> On 9/17/24 17:49, Lydian Lee wrote:
>
> Hi,
>
> We are using Beam Python SDK with Flink Runner, the Beam version is 2.41.0
> and the Flink version is 1.15.4.
>
> We have a pipeline that has 2 stages:
> 1. read from kafka and fixed window for every 1 minute
> 2. aggregate the data for the past 1 minute and reshuffle so that we have
> less partition count and write them into s3.
>
> We disabled the enable.auto.commit and enabled commit_offset_in_finalize.
> also the auto.offset.reset is set to "latest"
> [image: image.png]
>
> According to the log, I can definitely find the data is consuming from
> Kafka Offset, Because there are many
> ```
> Resetting offset for topic -  to offset 
> ```
> and that partition/offset pair does match the missing records.  However,
> it doesn't show up in the final S3.
>
> My current hypothesis is that the shuffling might be the reason for the
> issue, for example, originally in kafka for the past minute in partition
> 1,  I have offset 1, 2, 3 records. After reshuffle, it now distribute, for
> example:
> - partition A: 1, 3
> - partition B: 2
>
> And if partition A is done successfully but partition B fails. Given that
> A is succeeded, it will commit its offset to Kafka, and thus kafka now has
> an offset to 3.  And when kafka retries , it will skip the offset 2.
>  However, I am not sure how exactly the offset commit works, wondering how
> it interacts with the checkpoints.  But it does seem like if my hypothesis
> is correct, we should be seeing more missing records, however, this seems
> rare to happen.  Wondering if anyone can help identify potential
> root causes?  Thanks
>
>
>
>
>


Re: Beam dropping events from Kafka after reshuffle ?

2024-09-17 Thread Jan Lukavský

Hi Lydian,

2.41.0 is quite old, can you please try current version to see if this 
issue is still present? There were lots of changes between 2.41.0 and 
2.59.0.


 Jan

On 9/17/24 17:49, Lydian Lee wrote:

Hi,

We are using Beam Python SDK with Flink Runner, the Beam version is 
2.41.0 and the Flink version is 1.15.4.


We have a pipeline that has 2 stages:
1. read from kafka and fixed window for every 1 minute
2. aggregate the data for the past 1 minute and reshuffle so that we 
have less partition count and write them into s3.


We disabled the enable.auto.commit and enabled 
commit_offset_in_finalize. also the auto.offset.reset is set to "latest"

image.png

According to the log, I can definitely find the data is consuming from 
Kafka Offset, Because there are many

```
Resetting offset for topic -  to offset 
```
and that partition/offset pair does match the missing records.  
However, it doesn't show up in the final S3.


My current hypothesis is that the shuffling might be the reason for 
the issue, for example, originally in kafka for the past minute in 
partition 1,  I have offset 1, 2, 3 records. After reshuffle, it now 
distribute, for example:

- partition A: 1, 3
- partition B: 2

And if partition A is done successfully but partition B fails. Given 
that A is succeeded, it will commit its offset to Kafka, and thus 
kafka now has an offset to 3.  And when kafka retries , it will skip 
the offset 2.   However, I am not sure how exactly the offset commit 
works, wondering how it interacts with the checkpoints.  But it does 
seem like if my hypothesis is correct, we should be seeing more 
missing records, however, this seems rare to happen.  Wondering if 
anyone can help identify potential root causes?  Thanks






Re:

2024-09-16 Thread Ahmet Altay via user
Hi Ahijah -- Email did not have your question. Do you have a question?

(moving dev list to bcc.)

On Mon, Sep 16, 2024 at 12:57 PM Ahijah Koil Boaz Isacejayakumar via dev <
d...@beam.apache.org> wrote:

>
>
> This message contains proprietary information from Equifax which may be
> confidential. If you are not an intended recipient, please refrain from any
> disclosure, copying, distribution or use of this information and note that
> such actions are prohibited. If you have received this transmission in
> error, please notify by e-mail postmas...@equifax.com.
>
> Equifax® is a registered trademark of Equifax Inc.  All rights reserved.
>


Re: Performance issue BigQueryIO DynamicDestinations.

2024-08-30 Thread Ahmed Abualsaud via user
Hey Siyuan,

Can you open a GitHub issue with a reproducible code snippet? We can try
and take a look but will need more info (for example, which write method
are you using).

Best,
Ahmed

On Thu, Aug 29, 2024 at 6:21 PM hsy...@gmail.com  wrote:

> I have a question regarding the performance of BigQueryIO
> DynamicDestinations.
> In my use case, I need write the data to hundreds of different tables in
> different projects(same region). And the data volume for each table is
> quite different which means the data is highly skewed. I observe that the
> write performance is very very bad(10x slower) compare if I write all the
> data into just one table. All the tables are in same schema
>
> Any recommended solutions? Thanks!
>
> Regards,
> Siyuan
>


Re: [Question] Write a PCollection of elements periodically to a file in Flink on Beam application

2024-08-29 Thread Wiśniowski Piotr

Vamsi,

Actually I do not think it is possible to avoid shuffling just because 
you want to put data into 5 minute fixed windows and there is no 
ordering on the incoming input data.
I am not sure what is your upstream source or transformation that 
creates the pcollection that you want to save, but there is no 
assumption that the elements are correlated to time. The only guarantee 
is that there will not be any earlier element than watermark.


So in your problem source pcollection can get messages with this 
timestamps at same moment into different workers:

-e1, 2024-01-01T00:00:10

-e2, 2024-01-01T00:03:10

-e3, 2024-01-01T00:08:10

-e4, 2024-01-01T00:04:10

-e5, 2024-01-01T00:06:10

-e6, 2024-01-01T00:09:10

Since this is distributed computation there is no possibility to order 
the events globally. Also there is no guarantee that all workers will be 
processing all time window buckets.


What can be done is that the worker that receives the message checks if 
it is responsible for computing this message window and stores it on a 
disc or forwards the message to the worker that is actually responsible 
for the message window (which is really a shuffle operation).


What could be done to optimize more the code with sharding prefix is try 
to use `GroupIntoBatches.WithShardedKey`(impl ref 
https://github.com/apache/beam/blob/bcee5d081d05841bb52f0851f94a04f9c8968b88/sdks/python/apache_beam/transforms/util.py#L1113) 
as some runners might optimize this shuffle so that minimum amount of 
data is actually sent thru network.
You may try to use that transform (if it will meed your requirements) or 
take a look there how `ShardedKey` is used there.


On the idea of custom executor thread - this is asking for troubles. 
Typically beam (and all other distributed computation engines) are 
designed to handle parallelization for you and if you try to actually do 
something with threads on worker things typically end badly (random 
failures, missing data, hard debugging). I really doubt it would be more 
performant than actual shuffle with beam way.


Best,

Wisniowski Piotr


On 27.08.2024 20:35, vamsikrishna korada wrote:

Thanks for the reply Wisniowski Piotr.

As you mentioned, the data can be big and I am okay with writing 
multiple files every 5 minutes..
If I assign a key that uniquely represents a worker, that makes sure 
all the data that is processed together stays together. But there can 
still be a shuffle and the data can collectively be moved across nodes 
when we are assigning the keys right? Is there any way I can avoid the 
extra shuffle?


Do you think it is a good idea to instead have a separate scheduler 
thread within the DoFn which will periodically flush the data every 5 
minutes from each worker?

Thanks,
Vamsi

On Tue, 27 Aug 2024 at 17:49, Wiśniowski Piotr 
 wrote:


Hi,

If you require only a single file per 5 min then what you are
missing is that you need to window data into fixed windows, so
that state-full DoFn could store the elements per key per window.

Something like (did not test this code just pseudo code):

classStatefulWriteToFileDoFn(beam.DoFn):
BAG_STATE = BagStateSpec('bag_state', VarIntCoder())
TIMER = TimerSpec('timer', TimeDomain.WATERMARK)
defprocess(self, element, timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
bag_state=beam.DoFn.StateParam(BAG_STATE),
timer=beam.DoFn.TimerParam(TIMER)):
bag_state.add(element)
timer.set(window.end)
defon_timer(self, window=beam.DoFn.WindowParam,
bag_state=beam.DoFn.StateParam(BAG_STATE)):
# Here, you can generate a filename based on the window's end
time, for example
filename =
f'output_{window.end.to_utc_datetime().strftime("%Y%m%d%H%M%S")}.txt'
withopen(filename, 'w') asf:
forelement inbag_state.read():
f.write(f'{element}\n')
defrun():
withbeam.Pipeline(options=PipelineOptions()) asp:
(p
|'ReadStream'>>beam.io.ReadFromPubSub(topic='projects/.../topics/...')
|'WindowIntoFixed'>>beam.WindowInto(FixedWindows(300)) # 5-minute
windows
|'MapToSingleKey'>>beam.Map(lambdax: (1, x)) # Map to a single key
|'StatefulWriteToFile'>>beam.ParDo(StatefulWriteToFileDoFn())
)

So the state-full dofn will keep its state per key per window. So
each window should crate its own output file. But note that this
requires you to put all the 5 min data in same worker as only a
single worker can be responsible for creating the file and hence
the shuffling is required.

There is a workaround If the traffic might be too big, but it
would mean to generate more files per 5 min window (one file per
worker).

The trick is to assign key that uniquely represents the worker,
not the data. So every worker that maps the key should have his
unique value put in this key.
See this example:

class_Key(DoFn):
defsetup(self):
self.shard_prefix=str(uuid4())
defprocess(
 

Re: [Question] Write a PCollection of elements periodically to a file in Flink on Beam application

2024-08-27 Thread vamsikrishna korada
Thanks for the reply Wisniowski Piotr.

As you mentioned, the data can be big and I am okay with writing multiple
files every 5 minutes..
If I assign a key that uniquely represents a worker, that makes sure all
the data that is processed together stays together. But there can still be
a shuffle and the data can collectively be moved across nodes when we are
assigning the keys right? Is there any way I can avoid the extra shuffle?

Do you think it is a good idea to instead have a separate scheduler thread
within the DoFn which will periodically flush the data every 5 minutes from
each worker?

Thanks,
Vamsi

On Tue, 27 Aug 2024 at 17:49, Wiśniowski Piotr <
contact.wisniowskipi...@gmail.com> wrote:

> Hi,
>
> If you require only a single file per 5 min then what you are missing is
> that you need to window data into fixed windows, so that state-full DoFn
> could store the elements per key per window.
>
> Something like (did not test this code just pseudo code):
> class StatefulWriteToFileDoFn(beam.DoFn):
> BAG_STATE = BagStateSpec('bag_state', VarIntCoder())
> TIMER = TimerSpec('timer', TimeDomain.WATERMARK)
> def process(self, element, timestamp=beam.DoFn.TimestampParam,
> window=beam.DoFn.WindowParam,
> bag_state=beam.DoFn.StateParam(BAG_STATE),
> timer=beam.DoFn.TimerParam(TIMER)):
> bag_state.add(element)
> timer.set(window.end)
> def on_timer(self, window=beam.DoFn.WindowParam, bag_state=
> beam.DoFn.StateParam(BAG_STATE)):
> # Here, you can generate a filename based on the window's end time, for
> example
> filename = f'output_{window.end.to_utc_datetime().strftime("%Y%m%d%H%M%S")
> }.txt'
> with open(filename, 'w') as f:
> for element in bag_state.read():
> f.write(f'{element}\n')
> def run():
> with beam.Pipeline(options=PipelineOptions()) as p:
> (p
> | 'ReadStream' >> beam.io.ReadFromPubSub(topic='projects/.../topics/...')
> | 'WindowIntoFixed' >> beam.WindowInto(FixedWindows(300)) # 5-minute
> windows
> | 'MapToSingleKey' >> beam.Map(lambda x: (1, x)) # Map to a single key
> | 'StatefulWriteToFile' >> beam.ParDo(StatefulWriteToFileDoFn())
> )
>
> So the state-full dofn will keep its state per key per window. So each
> window should crate its own output file. But note that this requires you to
> put all the 5 min data in same worker as only a single worker can be
> responsible for creating the file and hence the shuffling is required.
>
> There is a workaround If the traffic might be too big, but it would mean
> to generate more files per 5 min window (one file per worker).
>
> The trick is to assign key that uniquely represents the worker, not the
> data. So every worker that maps the key should have his unique value put in
> this key.
> See this example:
> class _Key(DoFn):
> def setup(self):
> self.shard_prefix = str(uuid4())
> def process(
> self,
> x: input_type,
> ) -> Iterable[tuple[str, input_type]]:
> yield (
> self.shard_prefix + str(threading.get_ident()), # each worker may create
> his batch
> x,
> )
> And then you can use it like
> | "KeyPerWorker" >> ParDo(_Key())
> instead of using constant key with the first approach. Also remember to
> make sure file names are unique if using this approach.
> Best
> Wisniowski Piotr
>
> On 25.08.2024 20:30, vamsikrishna korada wrote:
>
> Hi everyone,
>
> I'm new to Apache Beam and have a question regarding its usage.
>
> I have a scenario where I need to read a stream of elements from a
> PCollection and write them to a new file every 5 minutes.
>
> Initially, I considered using Timers and state stores, but I discovered
> that Timers are only applicable to KV pairs. If I convert my PCollection
> into a key-value pair with a dummy key and then use timers, I encountered
> several issues:
>
>1. It introduces an additional shuffle.
>2. With all elements sharing the same key, they would be processed by
>a single task in the Flink on Beam application. I prefer not to manually
>define the number of keys based on load because I plan to run multiple
>pipelines, each with varying loads.
>
> One alternative I considered is using a custom executor thread within my
> Writer DoFn to flush the records every 5 minutes. However, this approach
> would require me to use a lock to make sure only one of the process element
> and the flush blocks are running at a time.
>
> Is there a more effective way to accomplish this?
>
>
>
> Thanks,
>
> Vamsi
>
>
>
>


Re: [Question] Write a PCollection of elements periodically to a file in Flink on Beam application

2024-08-27 Thread Wiśniowski Piotr

Hi,

If you require only a single file per 5 min then what you are missing is 
that you need to window data into fixed windows, so that state-full DoFn 
could store the elements per key per window.


Something like (did not test this code just pseudo code):

classStatefulWriteToFileDoFn(beam.DoFn):
BAG_STATE = BagStateSpec('bag_state', VarIntCoder())
TIMER = TimerSpec('timer', TimeDomain.WATERMARK)
defprocess(self, element, timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
bag_state=beam.DoFn.StateParam(BAG_STATE),
timer=beam.DoFn.TimerParam(TIMER)):
bag_state.add(element)
timer.set(window.end)
defon_timer(self, window=beam.DoFn.WindowParam, 
bag_state=beam.DoFn.StateParam(BAG_STATE)):
# Here, you can generate a filename based on the window's end time, for 
example
filename = 
f'output_{window.end.to_utc_datetime().strftime("%Y%m%d%H%M%S")}.txt'

withopen(filename, 'w') asf:
forelement inbag_state.read():
f.write(f'{element}\n')
defrun():
withbeam.Pipeline(options=PipelineOptions()) asp:
(p
|'ReadStream'>>beam.io.ReadFromPubSub(topic='projects/.../topics/...')
|'WindowIntoFixed'>>beam.WindowInto(FixedWindows(300)) # 5-minute windows
|'MapToSingleKey'>>beam.Map(lambdax: (1, x)) # Map to a single key
|'StatefulWriteToFile'>>beam.ParDo(StatefulWriteToFileDoFn())
)

So the state-full dofn will keep its state per key per window. So each 
window should crate its own output file. But note that this requires you 
to put all the 5 min data in same worker as only a single worker can be 
responsible for creating the file and hence the shuffling is required.


There is a workaround If the traffic might be too big, but it would mean 
to generate more files per 5 min window (one file per worker).


The trick is to assign key that uniquely represents the worker, not the 
data. So every worker that maps the key should have his unique value put 
in this key.

See this example:

class_Key(DoFn):
defsetup(self):
self.shard_prefix=str(uuid4())
defprocess(
self,
x: input_type,
) -> Iterable[tuple[str, input_type]]:
yield(
self.shard_prefix+str(threading.get_ident()), # each worker may create 
his batch

x,
)
And then you can use it like
|"KeyPerWorker">>ParDo(_Key())
instead of using constant key with the first approach. Also remember to 
make sure file names are unique if using this approach.

Best
Wisniowski Piotr

On 25.08.2024 20:30, vamsikrishna korada wrote:


Hi everyone,

I'm new to Apache Beam and have a question regarding its usage.

I have a scenario where I need to read a stream of elements from a 
PCollection and write them to a new file every 5 minutes.


Initially, I considered using Timers and state stores, but I 
discovered that Timers are only applicable to KV pairs. If I convert 
my PCollection into a key-value pair with a dummy key and then use 
timers, I encountered several issues:


 1. It introduces an additional shuffle.
 2. With all elements sharing the same key, they would be processed by
a single task in the Flink on Beam application. I prefer not to
manually define the number of keys based on load because I plan to
run multiple pipelines, each with varying loads.

One alternative I considered is using a custom executor thread within 
my Writer DoFn to flush the records every 5 minutes. However, this 
approach would require me to use a lock to make sure only one of the 
process element and the flush blocks are running at a time.


Is there a more effective way to accomplish this?



Thanks,

Vamsi




Re: async/await logic in a Beam DoFn task

2024-08-26 Thread Sofia’s World
Pretty sure it will as I am calling very similar code..let me try that out
and report back
Thanks!!

On Mon, 26 Aug 2024, 08:44 Jaehyeon Kim,  wrote:

> Something like this work for you? I just played with a 3rd party package
> so you need to install it (pip install yahoo-finance-async)
>
> import asyncio
>
> import apache_beam as beam
> from yahoo_finance_async import OHLC
>
>
> class AsyncProcess(beam.DoFn):
> async def fetch_data(self, element: str):
> result = await OHLC.fetch(symbol=element)
> return [d["open"] for d in result["candles"]]
>
> def process(self, element: str):
> return asyncio.run(self.fetch_data(element))
>
>
> def run():
> with beam.Pipeline() as p:
> (p | beam.Create(["AAPL"]) | beam.ParDo(AsyncProcess()) | beam.Map
> (print))
>
>
> if __name__ == "__main__":
> run()
>
> On Mon, 26 Aug 2024 at 16:48, Sofia’s World  wrote:
>
>> Hello
>>  thanks all
>>  have tried this simple test case - before checking URL above - but did
>> not work out
>>
>> async def process_data(element):
>> # Asynchronous processing logic
>> await asyncio.sleep(5)  # Simulate an asynchronous operation
>> return element
>>
>>
>> class AsyncProcess(beam.DoFn):
>>
>> def __init__(self):
>> self.loop = asyncio.new_event_loop()
>> def process(self, element):
>> async def process_element_async():
>> return await process_data(element)
>>
>> return self.loop.run_until_complete(process_element_async())
>>
>>
>> when i put this in a pipeline  (debugSink is simply  beam.Map(
>> logging.info)
>>
>> with TestPipeline(options=PipelineOptions()) as p:
>> input = (p | 'Start' >> beam.Create(['AAPL'])
>>  | 'Run Loader' >> beam.ParDo(AsyncProcess())
>>  | self.debugSink
>>  )
>>
>> i end up getting this.. which somehow i was expecting
>>
>> self = > type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 51510),
>> raddr=('127.0.0.1', 51511)>
>>
>> def __getstate__(self):
>> >   raise TypeError(f"cannot pickle {self.__class__.__name__!r}
>> object")
>> E   TypeError: cannot pickle 'socket' object
>>
>>
>> The usecase i have is that i have to call an API which under the hood
>> uses async api
>>
>> data = await fetcher.fetch_data(params, credentials)
>>
>>
>>
>> so on my side, not sure which options i have  (bear in mind i am also an
>> asyncio  noob)
>>
>> Thanks and regards
>> Marco
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Aug 25, 2024 at 11:24 PM Jaehyeon Kim  wrote:
>>
>>> I guess there are multiple options.
>>>
>>> The easiest one would be converting the async method to a sync method or
>>> creating a wrapper method for doing so.
>>>
>>> Also, the following stackoverflow post introduces another two options
>>> you may find useful.
>>>
>>>- asynchronous API calls in apache beam -
>>>
>>> https://stackoverflow.com/questions/72842846/asynchronous-api-calls-in-apache-beam
>>>
>>>
>>>
>>>
>>> On Mon, 26 Aug 2024 at 01:40, Sofia’s World  wrote:
>>>
 Not sure...I think I'll do a sample  and post it on the list ..
 Thanks
  Marco

 On Sun, 25 Aug 2024, 14:46 XQ Hu via user, 
 wrote:

>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html
>
> Is this something you are looking for?
>
> On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World 
> wrote:
>
>> HI all
>>   i want to write a pipeline where, as part of one of the steps, i
>> will need to use
>> an await  call  ,such as this one
>>
>> await fetcher.fetch_data(params, self.credentials)
>>
>> is Beam equipped for that?
>>
>> Does anyone have a sample to pass me for reference?
>>
>> Kind regards
>>  Marco
>>
>>
>>


Re: async/await logic in a Beam DoFn task

2024-08-26 Thread Jaehyeon Kim
Something like this work for you? I just played with a 3rd party package so
you need to install it (pip install yahoo-finance-async)

import asyncio

import apache_beam as beam
from yahoo_finance_async import OHLC


class AsyncProcess(beam.DoFn):
async def fetch_data(self, element: str):
result = await OHLC.fetch(symbol=element)
return [d["open"] for d in result["candles"]]

def process(self, element: str):
return asyncio.run(self.fetch_data(element))


def run():
with beam.Pipeline() as p:
(p | beam.Create(["AAPL"]) | beam.ParDo(AsyncProcess()) | beam.Map(
print))


if __name__ == "__main__":
run()

On Mon, 26 Aug 2024 at 16:48, Sofia’s World  wrote:

> Hello
>  thanks all
>  have tried this simple test case - before checking URL above - but did
> not work out
>
> async def process_data(element):
> # Asynchronous processing logic
> await asyncio.sleep(5)  # Simulate an asynchronous operation
> return element
>
>
> class AsyncProcess(beam.DoFn):
>
> def __init__(self):
> self.loop = asyncio.new_event_loop()
> def process(self, element):
> async def process_element_async():
> return await process_data(element)
>
> return self.loop.run_until_complete(process_element_async())
>
>
> when i put this in a pipeline  (debugSink is simply  beam.Map(logging.info
> )
>
> with TestPipeline(options=PipelineOptions()) as p:
> input = (p | 'Start' >> beam.Create(['AAPL'])
>  | 'Run Loader' >> beam.ParDo(AsyncProcess())
>  | self.debugSink
>  )
>
> i end up getting this.. which somehow i was expecting
>
> self =  type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 51510),
> raddr=('127.0.0.1', 51511)>
>
> def __getstate__(self):
> >   raise TypeError(f"cannot pickle {self.__class__.__name__!r}
> object")
> E   TypeError: cannot pickle 'socket' object
>
>
> The usecase i have is that i have to call an API which under the hood uses
> async api
>
> data = await fetcher.fetch_data(params, credentials)
>
>
>
> so on my side, not sure which options i have  (bear in mind i am also an
> asyncio  noob)
>
> Thanks and regards
> Marco
>
>
>
>
>
>
>
>
> On Sun, Aug 25, 2024 at 11:24 PM Jaehyeon Kim  wrote:
>
>> I guess there are multiple options.
>>
>> The easiest one would be converting the async method to a sync method or
>> creating a wrapper method for doing so.
>>
>> Also, the following stackoverflow post introduces another two options you
>> may find useful.
>>
>>- asynchronous API calls in apache beam -
>>
>> https://stackoverflow.com/questions/72842846/asynchronous-api-calls-in-apache-beam
>>
>>
>>
>>
>> On Mon, 26 Aug 2024 at 01:40, Sofia’s World  wrote:
>>
>>> Not sure...I think I'll do a sample  and post it on the list ..
>>> Thanks
>>>  Marco
>>>
>>> On Sun, 25 Aug 2024, 14:46 XQ Hu via user,  wrote:
>>>

 https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html

 Is this something you are looking for?

 On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World 
 wrote:

> HI all
>   i want to write a pipeline where, as part of one of the steps, i
> will need to use
> an await  call  ,such as this one
>
> await fetcher.fetch_data(params, self.credentials)
>
> is Beam equipped for that?
>
> Does anyone have a sample to pass me for reference?
>
> Kind regards
>  Marco
>
>
>


Re: async/await logic in a Beam DoFn task

2024-08-25 Thread Sofia’s World
Hello
 thanks all
 have tried this simple test case - before checking URL above - but did not
work out

async def process_data(element):
# Asynchronous processing logic
await asyncio.sleep(5)  # Simulate an asynchronous operation
return element


class AsyncProcess(beam.DoFn):

def __init__(self):
self.loop = asyncio.new_event_loop()
def process(self, element):
async def process_element_async():
return await process_data(element)

return self.loop.run_until_complete(process_element_async())


when i put this in a pipeline  (debugSink is simply  beam.Map(logging.info)

with TestPipeline(options=PipelineOptions()) as p:
input = (p | 'Start' >> beam.Create(['AAPL'])
 | 'Run Loader' >> beam.ParDo(AsyncProcess())
 | self.debugSink
 )

i end up getting this.. which somehow i was expecting

self = 

def __getstate__(self):
>   raise TypeError(f"cannot pickle {self.__class__.__name__!r} object")
E   TypeError: cannot pickle 'socket' object


The usecase i have is that i have to call an API which under the hood uses
async api

data = await fetcher.fetch_data(params, credentials)



so on my side, not sure which options i have  (bear in mind i am also an
asyncio  noob)

Thanks and regards
Marco








On Sun, Aug 25, 2024 at 11:24 PM Jaehyeon Kim  wrote:

> I guess there are multiple options.
>
> The easiest one would be converting the async method to a sync method or
> creating a wrapper method for doing so.
>
> Also, the following stackoverflow post introduces another two options you
> may find useful.
>
>- asynchronous API calls in apache beam -
>
> https://stackoverflow.com/questions/72842846/asynchronous-api-calls-in-apache-beam
>
>
>
>
> On Mon, 26 Aug 2024 at 01:40, Sofia’s World  wrote:
>
>> Not sure...I think I'll do a sample  and post it on the list ..
>> Thanks
>>  Marco
>>
>> On Sun, 25 Aug 2024, 14:46 XQ Hu via user,  wrote:
>>
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html
>>>
>>> Is this something you are looking for?
>>>
>>> On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World 
>>> wrote:
>>>
 HI all
   i want to write a pipeline where, as part of one of the steps, i will
 need to use
 an await  call  ,such as this one

 await fetcher.fetch_data(params, self.credentials)

 is Beam equipped for that?

 Does anyone have a sample to pass me for reference?

 Kind regards
  Marco





Re: async/await logic in a Beam DoFn task

2024-08-25 Thread Jaehyeon Kim
I guess there are multiple options.

The easiest one would be converting the async method to a sync method or
creating a wrapper method for doing so.

Also, the following stackoverflow post introduces another two options you
may find useful.

   - asynchronous API calls in apache beam -
   
https://stackoverflow.com/questions/72842846/asynchronous-api-calls-in-apache-beam




On Mon, 26 Aug 2024 at 01:40, Sofia’s World  wrote:

> Not sure...I think I'll do a sample  and post it on the list ..
> Thanks
>  Marco
>
> On Sun, 25 Aug 2024, 14:46 XQ Hu via user,  wrote:
>
>>
>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html
>>
>> Is this something you are looking for?
>>
>> On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World 
>> wrote:
>>
>>> HI all
>>>   i want to write a pipeline where, as part of one of the steps, i will
>>> need to use
>>> an await  call  ,such as this one
>>>
>>> await fetcher.fetch_data(params, self.credentials)
>>>
>>> is Beam equipped for that?
>>>
>>> Does anyone have a sample to pass me for reference?
>>>
>>> Kind regards
>>>  Marco
>>>
>>>
>>>


Re: async/await logic in a Beam DoFn task

2024-08-25 Thread Sofia’s World
Not sure...I think I'll do a sample  and post it on the list ..
Thanks
 Marco

On Sun, 25 Aug 2024, 14:46 XQ Hu via user,  wrote:

>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html
>
> Is this something you are looking for?
>
> On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World  wrote:
>
>> HI all
>>   i want to write a pipeline where, as part of one of the steps, i will
>> need to use
>> an await  call  ,such as this one
>>
>> await fetcher.fetch_data(params, self.credentials)
>>
>> is Beam equipped for that?
>>
>> Does anyone have a sample to pass me for reference?
>>
>> Kind regards
>>  Marco
>>
>>
>>


Re: async/await logic in a Beam DoFn task

2024-08-25 Thread XQ Hu via user
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Wait.html

Is this something you are looking for?

On Sun, Aug 25, 2024 at 6:21 AM Sofia’s World  wrote:

> HI all
>   i want to write a pipeline where, as part of one of the steps, i will
> need to use
> an await  call  ,such as this one
>
> await fetcher.fetch_data(params, self.credentials)
>
> is Beam equipped for that?
>
> Does anyone have a sample to pass me for reference?
>
> Kind regards
>  Marco
>
>
>


Re: Question on slowly updating global window side inputs

2024-08-14 Thread Ahmet Altay via user
Thank you!

On Wed, Aug 14, 2024 at 4:55 PM Jaehyeon Kim  wrote:

> Hello,
>
> I created a PR that adds examples in the common pipeline patterns section
> for using a shared object as a cache -
> https://github.com/apache/beam/pull/32187.
>
> On Wed, 14 Aug 2024 at 07:59, Jaehyeon Kim  wrote:
>
>> Thank you for the suggestion. Let me think about how to contribute and
>> take an action.
>>
>> On Tue, 13 Aug 2024, 8:50 am Ahmet Altay via user, 
>> wrote:
>>
>>> Thank you for the follow up.
>>>
>>> If you think that presentation is useful, and this is not properly
>>> captured in docs, would you be kind enough to help us improve our docs? :)
>>>
>>> It could be a link to that deck, and github issue, or new content in
>>> docs based on that presentation.
>>>
>>> On Sat, Aug 3, 2024 at 6:37 PM Jaehyeon Kim  wrote:
>>>
 Hello,

 What I look into can actually be achieved by implementing one of the
 caching strategies in a talk at Beam Summit 2022.

- Strategies for caching data in Dataflow using Beam SDK

 

 Among the 4 options, I'd try a side input and the shared module
 (with/without side input) first.

 Cheers,
 Jaehyeon


 On Thu, 1 Aug 2024 at 13:30, Jaehyeon Kim  wrote:

> Thank you for letting me know. It is also available in the Python SDK
> -
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656
>
> However, it doesn't seem to meet the requirement that the side input
> values can change over time because the mentioned transform just seems to
> wait until the previous one gets completed. What I look into is, let's 
> say,
> a customer attribute changes then order records should be enriched with 
> the
> updated attribute.
>
> On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:
>
>> Hello. I found a similar example code.
>> You can use `Wait` PTransform.
>> Wait (Apache Beam 2.13.0)
>> 
>> beam.apache.org
>> 
>> [image: favicon.ico]
>> 
>> 
>>
>> Hope this helps.
>>
>> [image: stateful-beam-realtime.png]
>>
>> stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
>> at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · 
>> Jimmyst/stateful-beam-realtime
>> 
>> github.com
>> 
>>
>> 
>>
>>
>> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
>>
>> Hello,
>>
>> I'm looking into side input patterns especially slowly updating
>> global window side inputs -
>> https://beam.apache.org/documentation/patterns/side-inputs/
>>
>> It'd be useful if we need to enrich eg) order records with customer
>> details where customer details would be taken as a side input.
>>
>> Let's say we have two Kafka topics, one for client records and the
>> other for order records. For the enrichment to work properly, consumption
>> of order records should wait until all customer records are read.
>>
>> Can you please inform me if it is achievable?
>>
>> Cheers,
>> Jaehyeon
>>
>>
>>


favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-08-14 Thread Jaehyeon Kim
Hello,

I created a PR that adds examples in the common pipeline patterns section
for using a shared object as a cache -
https://github.com/apache/beam/pull/32187.

On Wed, 14 Aug 2024 at 07:59, Jaehyeon Kim  wrote:

> Thank you for the suggestion. Let me think about how to contribute and
> take an action.
>
> On Tue, 13 Aug 2024, 8:50 am Ahmet Altay via user, 
> wrote:
>
>> Thank you for the follow up.
>>
>> If you think that presentation is useful, and this is not properly
>> captured in docs, would you be kind enough to help us improve our docs? :)
>>
>> It could be a link to that deck, and github issue, or new content in docs
>> based on that presentation.
>>
>> On Sat, Aug 3, 2024 at 6:37 PM Jaehyeon Kim  wrote:
>>
>>> Hello,
>>>
>>> What I look into can actually be achieved by implementing one of the
>>> caching strategies in a talk at Beam Summit 2022.
>>>
>>>- Strategies for caching data in Dataflow using Beam SDK
>>>
>>> 
>>>
>>> Among the 4 options, I'd try a side input and the shared module
>>> (with/without side input) first.
>>>
>>> Cheers,
>>> Jaehyeon
>>>
>>>
>>> On Thu, 1 Aug 2024 at 13:30, Jaehyeon Kim  wrote:
>>>
 Thank you for letting me know. It is also available in the Python SDK -
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656

 However, it doesn't seem to meet the requirement that the side input
 values can change over time because the mentioned transform just seems to
 wait until the previous one gets completed. What I look into is, let's say,
 a customer attribute changes then order records should be enriched with the
 updated attribute.

 On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:

> Hello. I found a similar example code.
> You can use `Wait` PTransform.
> Wait (Apache Beam 2.13.0)
> 
> beam.apache.org
> 
> [image: favicon.ico]
> 
> 
>
> Hope this helps.
>
> [image: stateful-beam-realtime.png]
>
> stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
> at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · 
> Jimmyst/stateful-beam-realtime
> 
> github.com
> 
>
> 
>
>
> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
>
> Hello,
>
> I'm looking into side input patterns especially slowly updating global
> window side inputs -
> https://beam.apache.org/documentation/patterns/side-inputs/
>
> It'd be useful if we need to enrich eg) order records with customer
> details where customer details would be taken as a side input.
>
> Let's say we have two Kafka topics, one for client records and the
> other for order records. For the enrichment to work properly, consumption
> of order records should wait until all customer records are read.
>
> Can you please inform me if it is achievable?
>
> Cheers,
> Jaehyeon
>
>
>


favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-08-13 Thread Jaehyeon Kim
Thank you for the suggestion. Let me think about how to contribute and take
an action.

On Tue, 13 Aug 2024, 8:50 am Ahmet Altay via user, 
wrote:

> Thank you for the follow up.
>
> If you think that presentation is useful, and this is not properly
> captured in docs, would you be kind enough to help us improve our docs? :)
>
> It could be a link to that deck, and github issue, or new content in docs
> based on that presentation.
>
> On Sat, Aug 3, 2024 at 6:37 PM Jaehyeon Kim  wrote:
>
>> Hello,
>>
>> What I look into can actually be achieved by implementing one of the
>> caching strategies in a talk at Beam Summit 2022.
>>
>>- Strategies for caching data in Dataflow using Beam SDK
>>
>> 
>>
>> Among the 4 options, I'd try a side input and the shared module
>> (with/without side input) first.
>>
>> Cheers,
>> Jaehyeon
>>
>>
>> On Thu, 1 Aug 2024 at 13:30, Jaehyeon Kim  wrote:
>>
>>> Thank you for letting me know. It is also available in the Python SDK -
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656
>>>
>>> However, it doesn't seem to meet the requirement that the side input
>>> values can change over time because the mentioned transform just seems to
>>> wait until the previous one gets completed. What I look into is, let's say,
>>> a customer attribute changes then order records should be enriched with the
>>> updated attribute.
>>>
>>> On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:
>>>
 Hello. I found a similar example code.
 You can use `Wait` PTransform.
 Wait (Apache Beam 2.13.0)
 
 beam.apache.org
 
 [image: favicon.ico]
 
 

 Hope this helps.

 [image: stateful-beam-realtime.png]

 stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
 at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · 
 Jimmyst/stateful-beam-realtime
 
 github.com
 

 


 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:

 Hello,

 I'm looking into side input patterns especially slowly updating global
 window side inputs -
 https://beam.apache.org/documentation/patterns/side-inputs/

 It'd be useful if we need to enrich eg) order records with customer
 details where customer details would be taken as a side input.

 Let's say we have two Kafka topics, one for client records and the
 other for order records. For the enrichment to work properly, consumption
 of order records should wait until all customer records are read.

 Can you please inform me if it is achievable?

 Cheers,
 Jaehyeon





favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-08-12 Thread Ahmet Altay via user
Thank you for the follow up.

If you think that presentation is useful, and this is not properly captured
in docs, would you be kind enough to help us improve our docs? :)

It could be a link to that deck, and github issue, or new content in docs
based on that presentation.

On Sat, Aug 3, 2024 at 6:37 PM Jaehyeon Kim  wrote:

> Hello,
>
> What I look into can actually be achieved by implementing one of the
> caching strategies in a talk at Beam Summit 2022.
>
>- Strategies for caching data in Dataflow using Beam SDK
>
> 
>
> Among the 4 options, I'd try a side input and the shared module
> (with/without side input) first.
>
> Cheers,
> Jaehyeon
>
>
> On Thu, 1 Aug 2024 at 13:30, Jaehyeon Kim  wrote:
>
>> Thank you for letting me know. It is also available in the Python SDK -
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656
>>
>> However, it doesn't seem to meet the requirement that the side input
>> values can change over time because the mentioned transform just seems to
>> wait until the previous one gets completed. What I look into is, let's say,
>> a customer attribute changes then order records should be enriched with the
>> updated attribute.
>>
>> On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:
>>
>>> Hello. I found a similar example code.
>>> You can use `Wait` PTransform.
>>> Wait (Apache Beam 2.13.0)
>>> 
>>> beam.apache.org
>>> 
>>> [image: favicon.ico]
>>> 
>>> 
>>>
>>> Hope this helps.
>>>
>>> [image: stateful-beam-realtime.png]
>>>
>>> stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
>>> at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · Jimmyst/stateful-beam-realtime
>>> 
>>> github.com
>>> 
>>>
>>> 
>>>
>>>
>>> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
>>>
>>> Hello,
>>>
>>> I'm looking into side input patterns especially slowly updating global
>>> window side inputs -
>>> https://beam.apache.org/documentation/patterns/side-inputs/
>>>
>>> It'd be useful if we need to enrich eg) order records with customer
>>> details where customer details would be taken as a side input.
>>>
>>> Let's say we have two Kafka topics, one for client records and the other
>>> for order records. For the enrichment to work properly, consumption of
>>> order records should wait until all customer records are read.
>>>
>>> Can you please inform me if it is achievable?
>>>
>>> Cheers,
>>> Jaehyeon
>>>
>>>
>>>


favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-08-03 Thread Jaehyeon Kim
Hello,

What I look into can actually be achieved by implementing one of the
caching strategies in a talk at Beam Summit 2022.

   - Strategies for caching data in Dataflow using Beam SDK
   


Among the 4 options, I'd try a side input and the shared module
(with/without side input) first.

Cheers,
Jaehyeon


On Thu, 1 Aug 2024 at 13:30, Jaehyeon Kim  wrote:

> Thank you for letting me know. It is also available in the Python SDK -
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656
>
> However, it doesn't seem to meet the requirement that the side input
> values can change over time because the mentioned transform just seems to
> wait until the previous one gets completed. What I look into is, let's say,
> a customer attribute changes then order records should be enriched with the
> updated attribute.
>
> On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:
>
>> Hello. I found a similar example code.
>> You can use `Wait` PTransform.
>> Wait (Apache Beam 2.13.0)
>> 
>> beam.apache.org
>> 
>> [image: favicon.ico]
>> 
>> 
>>
>> Hope this helps.
>>
>> [image: stateful-beam-realtime.png]
>>
>> stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
>> at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · Jimmyst/stateful-beam-realtime
>> 
>> github.com
>> 
>>
>> 
>>
>>
>> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
>>
>> Hello,
>>
>> I'm looking into side input patterns especially slowly updating global
>> window side inputs -
>> https://beam.apache.org/documentation/patterns/side-inputs/
>>
>> It'd be useful if we need to enrich eg) order records with customer
>> details where customer details would be taken as a side input.
>>
>> Let's say we have two Kafka topics, one for client records and the other
>> for order records. For the enrichment to work properly, consumption of
>> order records should wait until all customer records are read.
>>
>> Can you please inform me if it is achievable?
>>
>> Cheers,
>> Jaehyeon
>>
>>
>>


favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-07-31 Thread Jaehyeon Kim
Thank you for letting me know. It is also available in the Python SDK -
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L1656

However, it doesn't seem to meet the requirement that the side input values
can change over time because the mentioned transform just seems to wait
until the previous one gets completed. What I look into is, let's say, a
customer attribute changes then order records should be enriched with the
updated attribute.

On Thu, 1 Aug 2024 at 13:14, LDesire  wrote:

> Hello. I found a similar example code.
> You can use `Wait` PTransform.
> Wait (Apache Beam 2.13.0)
> 
> beam.apache.org
> 
> [image: favicon.ico]
> 
> 
>
> Hope this helps.
>
> [image: stateful-beam-realtime.png]
>
> stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
> at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · Jimmyst/stateful-beam-realtime
> 
> github.com
> 
>
> 
>
>
> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
>
> Hello,
>
> I'm looking into side input patterns especially slowly updating global
> window side inputs -
> https://beam.apache.org/documentation/patterns/side-inputs/
>
> It'd be useful if we need to enrich eg) order records with customer
> details where customer details would be taken as a side input.
>
> Let's say we have two Kafka topics, one for client records and the other
> for order records. For the enrichment to work properly, consumption of
> order records should wait until all customer records are read.
>
> Can you please inform me if it is achievable?
>
> Cheers,
> Jaehyeon
>
>
>


favicon.ico
Description: Binary data


Re: Question on slowly updating global window side inputs

2024-07-31 Thread LDesire
Hello. I found a similar example code.
You can use `Wait` PTransform.
https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/Wait.html

Hope this helps.

https://github.com/Jimmyst/stateful-beam-realtime/blob/2cc16a9cf8460c5b0e4d749e81654273c14ffb00/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java#L92
stateful-beam-realtime/pipeline/src/main/java/org/stjimmy/beam/LtvPipelineSqlLookup.java
 at 2cc16a9cf8460c5b0e4d749e81654273c14ffb00 · Jimmyst/stateful-beam-realtime
github.com


> 2024. 8. 1. 오전 11:52, Jaehyeon Kim  작성:
> 
> Hello,
> 
> I'm looking into side input patterns especially slowly updating global window 
> side inputs - https://beam.apache.org/documentation/patterns/side-inputs/
> 
> It'd be useful if we need to enrich eg) order records with customer details 
> where customer details would be taken as a side input.
> 
> Let's say we have two Kafka topics, one for client records and the other for 
> order records. For the enrichment to work properly, consumption of order 
> records should wait until all customer records are read.
> 
> Can you please inform me if it is achievable?
> 
> Cheers,
> Jaehyeon



Re: Apache beam github repo collaborator

2024-07-13 Thread XQ Hu via user
Welcome to Beam! You can start contributing now.

Some useful docs:

   - https://github.com/apache/beam/blob/master/CONTRIBUTING.md
   - https://github.com/apache/beam/tree/master/contributor-docs
   - https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides

You can start with some good first issues:
https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22+


On Sat, Jul 13, 2024 at 5:47 PM Prerit Chandok 
wrote:

> Hi Team,
>
> I would like to contribute to the Apache beam repository for which I would
> like to request for the Contributor Access to the repo. Thanks.
>
> Best Regards,
> Prerit Chandok
>
>


Re: [Question] Issue with MongoDB Read in Apache Beam - InvalidBSON Error

2024-07-11 Thread XQ Hu via user
You are welcome to create a PR to fix this issue if you need to change the
connector source code.

On Sun, Jul 7, 2024 at 5:39 AM Marcin Stańczak 
wrote:

> Hello Apache Beam Community,
>
> I'm Marcin and I am currently working on a project using Apache Beam
> 2.57.0. I have encountered an issue when reading data from MongoDB
> with the "mongodbio" connector. I am unable to reach the
> transformation step due to an InvalidBSON error related to
> out-of-range dates.
>
> Error Message:
>
> bson.errors.InvalidBSON: year 55054 is out of range (Consider Using
> CodecOptions(datetime_conversion=DATETIME_AUTO) or
> MongoClient(datetime_conversion='DATETIME_AUTO')). See:
>
> https://pymongo.readthedocs.io/en/stable/examples/datetimes.html#handling-out-of-range-datetimes
>
> Here are the details of my setup:
>
> Apache Beam version: 2.57.0
> Python version: 3.10
>
> In my current MongoDB collection, it is possible to encounter dates
> that are out of the standard range, such as year 0 or years greater
> than , which causes this issue.
>
> I have handled this issue in standalone Python scripts using
> CodecOptions and DatetimeConversion. However, I am facing difficulties
> integrating this logic within an Apache Beam pipeline and I don't
> think it's possible to handle without changing the source code of this
> connector. I would appreciate any guidance or suggestions on how to
> resolve this issue within the Beam framework.
>
> Thank you for your assistance.
>
> Best regards,
> Marcin
>


RE: Re: Question: Pipelines Stuck with Java 21 and BigQuery Storage Write API

2024-07-08 Thread Kensuke Tachibana
Hi,

This is Kensuke. I’m responding on behalf of Kazuha (he is my colleague).

> Which runner are you using?

We are using the DirectRunner. As you suggested, we specified
“–add-opens=java.base/java.lang=ALL-UNNAMED” in the JVM invocation command
line, and it worked in the Java 21 environment. Specifically, after setting
“export MAVEN_OPTS=‘–add-opens=java.base/java.lang=ALL-UNNAMED’”, we ran
“mvn compile exec:java” and the error was resolved.

> The same enforcement was introduced in both Java17 and 21, and it is
> strange that Java17 worked without the option but Java21 didn't.

I agree that it is strange.

> Are you testing on the same beam version and other configurations?

We have not made any changes other than what was mentioned in previous
emails.

> Try the latest beam version 2.56.0 and this option may not be needed.

We tried using Beam version 2.56.0, but now we are encountering a different
error and it doesn’t work. This error occurs with Beam 2.56.0 in both Java
17 and Java 21 environments. I have investigated the issue myself, but I
couldn’t determine the cause.

Do you have any insights based on this information?

On 2024/06/07 14:18:07 Yi Hu via user wrote:
> Hi,
>
> Which runner are you using? If you are running on Dataflow runner, then
> refer to this [1] and add
> "--jdkAddOpenModules=java.base/java.lang=ALL-UNNAMED" to pipeline option.
> If using direct runner, then add
> "--add-opens=java.base/java.lang=ALL-UNNAMED" to JVM invocation command
> line.
>
> The same enforcement was introduced in both Java17 and 21, and it is
> strange that Java17 worked without the option but Java21 didn't. Are you
> testing on the same beam version and other configurations? Also, more
> recent beam versions eliminated most usage of
> "ClassLoadingStrategy.Default.INJECTION"
> that cause this pipeline option being required, e.g. [2]. Try the latest
> beam version 2.56.0 and this option may not be needed.
>
> [1]
>
https://beam.apache.org/releases/javadoc/current/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html#getJdkAddOpenModules--
>
> [2] https://github.com/apache/beam/pull/30367
>
>
>
> On Mon, Jun 3, 2024 at 7:14 PM XQ Hu  wrote:
>
> > Probably related to the strict encapsulation that is enforced with Java
> > 21.
> > Use `--add-opens=java.base/java.lang=ALL-UNNAMED` as the JVM flag could
be
> > a temporary workaround.
> >
> > On Mon, Jun 3, 2024 at 3:04 AM 田中万葉  wrote:
> >
> >> Hi all,
> >>
> >> I encountered an UnsupportedOperationException when using Java 21 and
the
> >> BigQuery Storage Write API in a Beam pipeline by using
> >> ".withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API));"
> >>
> >> Having read issue #28120[1] and understanding that Beam version 2.52.0
or
> >> later supports Java 21 as a runtime, I wonder why such an error
happens.
> >>
> >> I found there are two workarounds, but the Storage Write API is a more
> >> preferable way to insert data into BigQuery, so I'd like to find a
> >> solution.
> >>
> >> 1. One workaround is to switch from Java 21 to Java 17(openjdk version
> >> "17.0.10" 2024-01-16). By changing the  and
> >>  in the pom.xml file (i.e., without modifying
> >> App.java itself), the pipeline successfully writes data to my
destination
> >> table on BigQuery. It seems Java 17 and BigQuery Storage Write API
works
> >> fine.
> >> 2. The other workaround is to change insert method. I tried the
BigQuery
> >> legacy streaming API(
> >> https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery )
> >> instead of the Storage Write API. Even though I still used Java 21,
when I
> >> changed my code to
> >> .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS));, I did not
> >> encounter the error.
> >>
> >> So I faced the error only when using Java 21 and BigQuery Storage Write
> >> API.
> >>
> >> I uploaded the code below to reproduce. Could you please inform me how
to
> >> handle this issue?
> >> https://github.com/cloud-ace/min-reproduce
> >>
> >> My Environment
> >> - OS
> >>   - Ubuntu 22.04
> >>   - Mac OS Sonoma(14.3.1)
> >> - beam 2.53.0, 2.54.0
> >> - openjdk version "21.0.2" 2024-01-16
> >> - maven 3.9.6
> >> - DirectRunner
> >>
> >> Thanks,
> >>
> >> Kazuha
> >>
> >> [1]: https://github.com/apache/beam/issues/28120
> >>
> >> Here is the detailed error message.
> >>
> >> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
> >> java.lang.UnsupportedOperationException: Cannot define class using
> >> reflection: Unable to make protected java.lang.Package
> >> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
> >> java.base does not "opens java.lang" to unnamed module @116d5dff
> >>
> >> Caused by: java.lang.UnsupportedOperationException: Cannot define class
> >> using reflection: Unable to make protected java.lang.Package
> >> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
> >> java.base does not "opens java.lang" to unnamed module @116d5dff
> >> at
> >>
net.bytebuddy.dynamic.loading.Cla

Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-02 Thread Jan Lukavský
Unfortunately, no. At least not in the case of FlnkRunner. As already 
mentioned, Beam does not currently collect information about location of 
source splits, thus this information cannot be passed to Flink.


> If there is no locality aware processing the whole thing falls into 
its face.


1 Gibps network (current networks should be actually at least 10 Gibps) 
is quite "close" to a single spinning disk throughput. On the other hand 
if target is "seconds" you might want to have a look at some SQL-based 
distributed analytical engines, Flink startup times itself will likely 
add significant overhead on top of the processing time.


On 7/2/24 16:03, Balogh, György wrote:

Hi Jan,
I need to process hundreds of GBs of data within seconds. With local 
data processing I can properly size a hw infrastructure to meet this 
(a couple of years back i did this with hadoop, worked perfectly). If 
there is no locality aware processing the whole thing falls into its 
face.


This comment suggests flink might do this under the hood?

https://stackoverflow.com/questions/38672091/flink-batch-data-local-planning-on-hdfs
Br,
Gyorgy


On Tue, Jul 2, 2024 at 3:08 PM Jan Lukavský  wrote:

Hi Gyorgy,

there is no concept of 'data locality' in Beam that would be
analogous to how MapReduce used to work. The fact that tasks
(compute) are co-located with storage on input is not transferred
to Beam Flink pipelines. The whole concept is kind of ill defined
in terms of Beam model, where tasks can be (at least in theory,
depending on a runner) moved between workers in a distributed
environment. The reason for this is that throughput (and cost) is
dominated mostly by the ability to (uniformly) scale, not the
costs associated with network transfers (this is actually most
visible in the streaming case, where the data is already 'in
motion'). The most common case in Beam is that compute is
completely separated from storage (possible even in the extreme
cases where streaming state is stored outside the compute of
streaming pipeline - but cached locally). The resulting
'stateless' nature of workers generally enables easier and more
flexible scaling.

Having said that, although Beam currently does not (AFAIK) try to
leverage local reads, it _could_ be possible by a reasonable
extension to how splittable DoFn [1] works so that it could make
use of data locality. It would be non-trivial, tough and would
definitely require support from the runner (Flink in this case).

My general suggestion would be to implement a prototype and
measure throughput and part of it possible related to networking
before attempting to dig deeper into how to implement this in Beam
Flink.

Best,

 Jan

[1] https://beam.apache.org/blog/splittable-do-fn/

On 7/2/24 10:46, Balogh, György wrote:

Hi Jan,
Separating live and historic storage makes sense. I need a
historic storage that can ensure data local processing using the
beam - flink stack.
Can I surely achieve this with HDFS? I can colocate hdfs nodes
with flink workers. What exactly enforces that flink nodes will
read local and not remote data?
Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 3:42 PM Jan Lukavský  wrote:

Hi Gyorgy,

comments inline.

On 7/1/24 15:10, Balogh, György wrote:

Hi Jan,
Let me add a few more details to show the full picture. We
have live datastreams (video analysis metadata) and we would
like to run both live and historic pipelines on the metadata
(eg.: live alerts, historic video searches).

This should be fine due to Beam's unified model. You can
write a PTransform that handles PCollection<...> without the
need to worry if the PCollection was created from Kafka or
some bounded source.

We planned to use kafka to store the streaming data and
directly run both types of queries on top. You are
suggesting to consider having kafka with small retention to
server the live queries and store the historic data
somewhere else which scales better for historic queries? We
need to have on prem options here. What options should we
consider that scales nicely (in terms of IO parallelization)
with beam? (eg. hdfs?)


Yes, I would not say necessarily "small" retention, but
probably "limited" retention. Running on premise you can
choose from HDFS or maybe S3 compatible minio or some other
distributed storage, depends on the scale and deployment
options (e.g. YARN or k8s).

I also happen to work on a system which targets exactly these
streaming-batch workloads (persisting upserts from stream to
batch for reprocessing), see [1]. Please feel free to contact
me directly if this sounds interesting.

Best,

 Jan

 

Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-02 Thread Balogh , György
Hi Jan,
I need to process hundreds of GBs of data within seconds. With local data
processing I can properly size a hw infrastructure to meet this (a couple
of years back i did this with hadoop, worked perfectly). If there is no
locality aware processing the whole thing falls into its face.

This comment suggests flink might do this under the hood?

https://stackoverflow.com/questions/38672091/flink-batch-data-local-planning-on-hdfs
Br,
Gyorgy


On Tue, Jul 2, 2024 at 3:08 PM Jan Lukavský  wrote:

> Hi Gyorgy,
>
> there is no concept of 'data locality' in Beam that would be analogous to
> how MapReduce used to work. The fact that tasks (compute) are co-located
> with storage on input is not transferred to Beam Flink pipelines. The whole
> concept is kind of ill defined in terms of Beam model, where tasks can be
> (at least in theory, depending on a runner) moved between workers in a
> distributed environment. The reason for this is that throughput (and cost)
> is dominated mostly by the ability to (uniformly) scale, not the costs
> associated with network transfers (this is actually most visible in the
> streaming case, where the data is already 'in motion'). The most common
> case in Beam is that compute is completely separated from storage (possible
> even in the extreme cases where streaming state is stored outside the
> compute of streaming pipeline - but cached locally). The resulting
> 'stateless' nature of workers generally enables easier and more flexible
> scaling.
>
> Having said that, although Beam currently does not (AFAIK) try to leverage
> local reads, it _could_ be possible by a reasonable extension to how
> splittable DoFn [1] works so that it could make use of data locality. It
> would be non-trivial, tough and would definitely require support from the
> runner (Flink in this case).
>
> My general suggestion would be to implement a prototype and measure
> throughput and part of it possible related to networking before attempting
> to dig deeper into how to implement this in Beam Flink.
>
> Best,
>
>  Jan
>
> [1] https://beam.apache.org/blog/splittable-do-fn/
> On 7/2/24 10:46, Balogh, György wrote:
>
> Hi Jan,
> Separating live and historic storage makes sense. I need a historic
> storage that can ensure data local processing using the beam - flink stack.
> Can I surely achieve this with HDFS? I can colocate hdfs nodes with flink
> workers. What exactly enforces that flink nodes will read local and not
> remote data?
> Thank you,
> Gyorgy
>
> On Mon, Jul 1, 2024 at 3:42 PM Jan Lukavský  wrote:
>
>> Hi Gyorgy,
>>
>> comments inline.
>> On 7/1/24 15:10, Balogh, György wrote:
>>
>> Hi Jan,
>> Let me add a few more details to show the full picture. We have live
>> datastreams (video analysis metadata) and we would like to run both live
>> and historic pipelines on the metadata (eg.: live alerts, historic video
>> searches).
>>
>> This should be fine due to Beam's unified model. You can write a
>> PTransform that handles PCollection<...> without the need to worry if the
>> PCollection was created from Kafka or some bounded source.
>>
>> We planned to use kafka to store the streaming data and directly run both
>> types of queries on top. You are suggesting to consider having kafka with
>> small retention to server the live queries and store the historic data
>> somewhere else which scales better for historic queries? We need to have on
>> prem options here. What options should we consider that scales nicely (in
>> terms of IO parallelization) with beam? (eg. hdfs?)
>>
>> Yes, I would not say necessarily "small" retention, but probably
>> "limited" retention. Running on premise you can choose from HDFS or maybe
>> S3 compatible minio or some other distributed storage, depends on the scale
>> and deployment options (e.g. YARN or k8s).
>>
>> I also happen to work on a system which targets exactly these
>> streaming-batch workloads (persisting upserts from stream to batch for
>> reprocessing), see [1]. Please feel free to contact me directly if this
>> sounds interesting.
>>
>> Best,
>>
>>  Jan
>>
>> [1] https://github.com/O2-Czech-Republic/proxima-platform
>>
>> Thank you,
>> Gyorgy
>>
>> On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský  wrote:
>>
>>> H Gyorgy,
>>>
>>> I don't think it is possible to co-locate tasks as you describe it. Beam
>>> has no information about location of 'splits'. On the other hand, if batch
>>> throughput is the main concern, then reading from Kafka might not be the
>>> optimal choice. Although Kafka provides tiered storage for offloading
>>> historical data, it still somewhat limits scalability (and thus
>>> throughput), because the data have to be read by a broker and only then
>>> passed to a consumer. The parallelism is therefore limited by the number of
>>> Kafka partitions and not parallelism of the Flink job. A more scalable
>>> approach could be to persist data from Kafka to a batch storage (e.g. S3 or
>>> GCS) and reprocess it from there.
>>>
>>> Best,
>>>
>>>  Jan
>>

Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-02 Thread Jan Lukavský

Hi Gyorgy,

there is no concept of 'data locality' in Beam that would be analogous 
to how MapReduce used to work. The fact that tasks (compute) are 
co-located with storage on input is not transferred to Beam Flink 
pipelines. The whole concept is kind of ill defined in terms of Beam 
model, where tasks can be (at least in theory, depending on a runner) 
moved between workers in a distributed environment. The reason for this 
is that throughput (and cost) is dominated mostly by the ability to 
(uniformly) scale, not the costs associated with network transfers (this 
is actually most visible in the streaming case, where the data is 
already 'in motion'). The most common case in Beam is that compute is 
completely separated from storage (possible even in the extreme cases 
where streaming state is stored outside the compute of streaming 
pipeline - but cached locally). The resulting 'stateless' nature of 
workers generally enables easier and more flexible scaling.


Having said that, although Beam currently does not (AFAIK) try to 
leverage local reads, it _could_ be possible by a reasonable extension 
to how splittable DoFn [1] works so that it could make use of data 
locality. It would be non-trivial, tough and would definitely require 
support from the runner (Flink in this case).


My general suggestion would be to implement a prototype and measure 
throughput and part of it possible related to networking before 
attempting to dig deeper into how to implement this in Beam Flink.


Best,

 Jan

[1] https://beam.apache.org/blog/splittable-do-fn/

On 7/2/24 10:46, Balogh, György wrote:

Hi Jan,
Separating live and historic storage makes sense. I need a historic 
storage that can ensure data local processing using the beam - flink 
stack.
Can I surely achieve this with HDFS? I can colocate hdfs nodes with 
flink workers. What exactly enforces that flink nodes will read local 
and not remote data?

Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 3:42 PM Jan Lukavský  wrote:

Hi Gyorgy,

comments inline.

On 7/1/24 15:10, Balogh, György wrote:

Hi Jan,
Let me add a few more details to show the full picture. We have
live datastreams (video analysis metadata) and we would like to
run both live and historic pipelines on the metadata (eg.: live
alerts, historic video searches).

This should be fine due to Beam's unified model. You can write a
PTransform that handles PCollection<...> without the need to worry
if the PCollection was created from Kafka or some bounded source.

We planned to use kafka to store the streaming data and
directly run both types of queries on top. You are suggesting to
consider having kafka with small retention to server the live
queries and store the historic data somewhere else which scales
better for historic queries? We need to have on prem options
here. What options should we consider that scales nicely (in
terms of IO parallelization) with beam? (eg. hdfs?)


Yes, I would not say necessarily "small" retention, but probably
"limited" retention. Running on premise you can choose from HDFS
or maybe S3 compatible minio or some other distributed storage,
depends on the scale and deployment options (e.g. YARN or k8s).

I also happen to work on a system which targets exactly these
streaming-batch workloads (persisting upserts from stream to batch
for reprocessing), see [1]. Please feel free to contact me
directly if this sounds interesting.

Best,

 Jan

[1] https://github.com/O2-Czech-Republic/proxima-platform


Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský  wrote:

H Gyorgy,

I don't think it is possible to co-locate tasks as you
describe it. Beam has no information about location of
'splits'. On the other hand, if batch throughput is the main
concern, then reading from Kafka might not be the optimal
choice. Although Kafka provides tiered storage for offloading
historical data, it still somewhat limits scalability (and
thus throughput), because the data have to be read by a
broker and only then passed to a consumer. The parallelism is
therefore limited by the number of Kafka partitions and not
parallelism of the Flink job. A more scalable approach could
be to persist data from Kafka to a batch storage (e.g. S3 or
GCS) and reprocess it from there.

Best,

 Jan

On 6/29/24 09:12, Balogh, György wrote:

Hi,
I'm planning a distributed system with multiple kafka
brokers co located with flink workers.
Data processing throughput for historic queries is a main
KPI. So I want to make sure all flink workers read local
data and not remote. I'm defining my pipelines in beam using
java.
Is it possible? What are the critical config elements to
achieve this?
Tha

Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-02 Thread Balogh , György
Hi Jan,
Separating live and historic storage makes sense. I need a historic storage
that can ensure data local processing using the beam - flink stack.
Can I surely achieve this with HDFS? I can colocate hdfs nodes with flink
workers. What exactly enforces that flink nodes will read local and not
remote data?
Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 3:42 PM Jan Lukavský  wrote:

> Hi Gyorgy,
>
> comments inline.
> On 7/1/24 15:10, Balogh, György wrote:
>
> Hi Jan,
> Let me add a few more details to show the full picture. We have live
> datastreams (video analysis metadata) and we would like to run both live
> and historic pipelines on the metadata (eg.: live alerts, historic video
> searches).
>
> This should be fine due to Beam's unified model. You can write a
> PTransform that handles PCollection<...> without the need to worry if the
> PCollection was created from Kafka or some bounded source.
>
> We planned to use kafka to store the streaming data and directly run both
> types of queries on top. You are suggesting to consider having kafka with
> small retention to server the live queries and store the historic data
> somewhere else which scales better for historic queries? We need to have on
> prem options here. What options should we consider that scales nicely (in
> terms of IO parallelization) with beam? (eg. hdfs?)
>
> Yes, I would not say necessarily "small" retention, but probably "limited"
> retention. Running on premise you can choose from HDFS or maybe S3
> compatible minio or some other distributed storage, depends on the scale
> and deployment options (e.g. YARN or k8s).
>
> I also happen to work on a system which targets exactly these
> streaming-batch workloads (persisting upserts from stream to batch for
> reprocessing), see [1]. Please feel free to contact me directly if this
> sounds interesting.
>
> Best,
>
>  Jan
>
> [1] https://github.com/O2-Czech-Republic/proxima-platform
>
> Thank you,
> Gyorgy
>
> On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský  wrote:
>
>> H Gyorgy,
>>
>> I don't think it is possible to co-locate tasks as you describe it. Beam
>> has no information about location of 'splits'. On the other hand, if batch
>> throughput is the main concern, then reading from Kafka might not be the
>> optimal choice. Although Kafka provides tiered storage for offloading
>> historical data, it still somewhat limits scalability (and thus
>> throughput), because the data have to be read by a broker and only then
>> passed to a consumer. The parallelism is therefore limited by the number of
>> Kafka partitions and not parallelism of the Flink job. A more scalable
>> approach could be to persist data from Kafka to a batch storage (e.g. S3 or
>> GCS) and reprocess it from there.
>>
>> Best,
>>
>>  Jan
>> On 6/29/24 09:12, Balogh, György wrote:
>>
>> Hi,
>> I'm planning a distributed system with multiple kafka brokers co located
>> with flink workers.
>> Data processing throughput for historic queries is a main KPI. So I want
>> to make sure all flink workers read local data and not remote. I'm defining
>> my pipelines in beam using java.
>> Is it possible? What are the critical config elements to achieve this?
>> Thank you,
>> Gyorgy
>>
>> --
>>
>> György Balogh
>> CTO
>> E gyorgy.bal...@ultinous.com 
>> M +36 30 270 8342 <+36%2030%20270%208342>
>> A HU, 1117 Budapest, Budafoki út 209.
>> W www.ultinous.com
>>
>>
>
> --
>
> György Balogh
> CTO
> E gyorgy.bal...@ultinous.com 
> M +36 30 270 8342 <+36%2030%20270%208342>
> A HU, 1117 Budapest, Budafoki út 209.
> W www.ultinous.com
>
>

-- 

György Balogh
CTO
E gyorgy.bal...@ultinous.com 
M +36 30 270 8342 <+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com


Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-01 Thread Jan Lukavský

Hi Gyorgy,

comments inline.

On 7/1/24 15:10, Balogh, György wrote:

Hi Jan,
Let me add a few more details to show the full picture. We have live 
datastreams (video analysis metadata) and we would like to run both 
live and historic pipelines on the metadata (eg.: live alerts, 
historic video searches).
This should be fine due to Beam's unified model. You can write a 
PTransform that handles PCollection<...> without the need to worry if 
the PCollection was created from Kafka or some bounded source.
We planned to use kafka to store the streaming data and directly run 
both types of queries on top. You are suggesting to consider 
having kafka with small retention to server the live queries and store 
the historic data somewhere else which scales better for historic 
queries? We need to have on prem options here. What options should we 
consider that scales nicely (in terms of IO parallelization) with 
beam? (eg. hdfs?)


Yes, I would not say necessarily "small" retention, but probably 
"limited" retention. Running on premise you can choose from HDFS or 
maybe S3 compatible minio or some other distributed storage, depends on 
the scale and deployment options (e.g. YARN or k8s).


I also happen to work on a system which targets exactly these 
streaming-batch workloads (persisting upserts from stream to batch for 
reprocessing), see [1]. Please feel free to contact me directly if this 
sounds interesting.


Best,

 Jan

[1] https://github.com/O2-Czech-Republic/proxima-platform


Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský  wrote:

H Gyorgy,

I don't think it is possible to co-locate tasks as you describe
it. Beam has no information about location of 'splits'. On the
other hand, if batch throughput is the main concern, then reading
from Kafka might not be the optimal choice. Although Kafka
provides tiered storage for offloading historical data, it still
somewhat limits scalability (and thus throughput), because the
data have to be read by a broker and only then passed to a
consumer. The parallelism is therefore limited by the number of
Kafka partitions and not parallelism of the Flink job. A more
scalable approach could be to persist data from Kafka to a batch
storage (e.g. S3 or GCS) and reprocess it from there.

Best,

 Jan

On 6/29/24 09:12, Balogh, György wrote:

Hi,
I'm planning a distributed system with multiple kafka brokers co
located with flink workers.
Data processing throughput for historic queries is a main KPI. So
I want to make sure all flink workers read local data and not
remote. I'm defining my pipelines in beam using java.
Is it possible? What are the critical config elements to achieve
this?
Thank you,
Gyorgy

-- 


György Balogh
CTO
E   gyorgy.bal...@ultinous.com 
M   +36 30 270 8342 
A   HU, 1117 Budapest, Budafoki út 209.
W   www.ultinous.com 




--

György Balogh
CTO
E   gyorgy.bal...@ultinous.com 
M   +36 30 270 8342 
A   HU, 1117 Budapest, Budafoki út 209.
W   www.ultinous.com 


Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-01 Thread Balogh , György
Hi Jan,
Let me add a few more details to show the full picture. We have live
datastreams (video analysis metadata) and we would like to run both live
and historic pipelines on the metadata (eg.: live alerts, historic video
searches). We planned to use kafka to store the streaming data and
directly run both types of queries on top. You are suggesting to consider
having kafka with small retention to server the live queries and store the
historic data somewhere else which scales better for historic queries? We
need to have on prem options here. What options should we consider that
scales nicely (in terms of IO parallelization) with beam? (eg. hdfs?)
Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský  wrote:

> H Gyorgy,
>
> I don't think it is possible to co-locate tasks as you describe it. Beam
> has no information about location of 'splits'. On the other hand, if batch
> throughput is the main concern, then reading from Kafka might not be the
> optimal choice. Although Kafka provides tiered storage for offloading
> historical data, it still somewhat limits scalability (and thus
> throughput), because the data have to be read by a broker and only then
> passed to a consumer. The parallelism is therefore limited by the number of
> Kafka partitions and not parallelism of the Flink job. A more scalable
> approach could be to persist data from Kafka to a batch storage (e.g. S3 or
> GCS) and reprocess it from there.
>
> Best,
>
>  Jan
> On 6/29/24 09:12, Balogh, György wrote:
>
> Hi,
> I'm planning a distributed system with multiple kafka brokers co located
> with flink workers.
> Data processing throughput for historic queries is a main KPI. So I want
> to make sure all flink workers read local data and not remote. I'm defining
> my pipelines in beam using java.
> Is it possible? What are the critical config elements to achieve this?
> Thank you,
> Gyorgy
>
> --
>
> György Balogh
> CTO
> E gyorgy.bal...@ultinous.com 
> M +36 30 270 8342 <+36%2030%20270%208342>
> A HU, 1117 Budapest, Budafoki út 209.
> W www.ultinous.com
>
>

-- 

György Balogh
CTO
E gyorgy.bal...@ultinous.com 
M +36 30 270 8342 <+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com


Re: beam using flink runner to achive data locality in a distributed setup?

2024-07-01 Thread Jan Lukavský

H Gyorgy,

I don't think it is possible to co-locate tasks as you describe it. Beam 
has no information about location of 'splits'. On the other hand, if 
batch throughput is the main concern, then reading from Kafka might not 
be the optimal choice. Although Kafka provides tiered storage for 
offloading historical data, it still somewhat limits scalability (and 
thus throughput), because the data have to be read by a broker and only 
then passed to a consumer. The parallelism is therefore limited by the 
number of Kafka partitions and not parallelism of the Flink job. A more 
scalable approach could be to persist data from Kafka to a batch storage 
(e.g. S3 or GCS) and reprocess it from there.


Best,

 Jan

On 6/29/24 09:12, Balogh, György wrote:

Hi,
I'm planning a distributed system with multiple kafka brokers co 
located with flink workers.
Data processing throughput for historic queries is a main KPI. So I 
want to make sure all flink workers read local data and not remote. 
I'm defining my pipelines in beam using java.

Is it possible? What are the critical config elements to achieve this?
Thank you,
Gyorgy

--

György Balogh
CTO
E   gyorgy.bal...@ultinous.com 
M   +36 30 270 8342 
A   HU, 1117 Budapest, Budafoki út 209.
W   www.ultinous.com 


Re: Exactly once KafkaIO with flink runner

2024-06-24 Thread Jan Lukavský
I don't use Kafka transactions, so I could only speculate. Seems that 
the transaction times out before being committed. Looking into the code, 
this could happen if there is *huge* amount of work between checkpoints 
(i.e. checkpoints do not happen often enough). I'll suggest 
investigating the logs looking for logs coming from the 
KafkaExactlyOnceSink.


 Jan

[1] 
https://github.com/apache/beam/blob/a944bf87cd03d32105d87fc986ecba5b656683bc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L245


On 6/24/24 16:35, Ruben Vargas wrote:

On Mon, Jun 24, 2024 at 2:02 AM Jan Lukavský  wrote:

Hi,

the distribution of keys to workers might not be uniform, when the
number of keys is comparable to total parallelism. General advise would be:

   a) try to increase number of keys (EOS parallelism in this case) to be
at least several times higher than parallelism

Make sense, unfortunately I faced an error when I tried to put the
shards > partitions. :(

"message": 
"PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
ids -> ToGBKResult ->
PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
(4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException\n\tat

Do I need to move any configuration to do that?

Thanks


   b) increase maxParallelism (default 128, maximum 32768), as it might
influence the assignment of keys to downstream workers

Best,

   Jan

On 6/21/24 05:25, Ruben Vargas wrote:

Image as not correctly attached. sending it again. Sorry

Thanks

On Thu, Jun 20, 2024 at 9:25 PM Ruben Vargas  wrote:

Hello guys, me again

I was trying to debug the issue with the  backpressure and I noticed
that even if I set the shards = 16, not all tasks are receiving
messages (attaching screenshot). You know potential causes and
solutions?

I really appreciate any help you can provide


Thank you very much!

Regards.


On Wed, Jun 19, 2024 at 11:09 PM Ruben Vargas  wrote:

Hello again

Thank you for all the suggestions.

Unfortunately if I put more shards than partitions it throws me this exception

"message": 
"PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
ids -> ToGBKResult ->
PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
(4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException\n\tat
..
..
..
org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
org.apache.kafka.common.errors.TimeoutException: Timeout expired after
6ms while awaiting AddOffsetsToTxn\n",


Any other alternative? Thank you very much!

Regards

On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:

Hi,

regarding aligned vs unaligned checkpoints I recommend reading [1], it
explains it quite well. Generally, I would prefer unaligned checkpoints
in this case.

Another thing to consider is the number of shards of the EOS sink.
Because how the shards are distributed among workers, it might be good
idea to actually increase that to some number higher than number of
target partitions (e.g. targetPartitions * 10 or so). Additional thing
to consider is increasing maxParallelism of the pipeline (e.g. max value
is 32768), as it also affects how 'evenly' Flink assigns shards to
workers. You can check if the assignment is even using counters in the
sink operator(s).

Jan

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/

On 6/19/24 05:15, Ruben Vargas wrote:

Hello guys

Now I was able to pass that error.

I had to set the consumer factory function
.withConsumerFactoryFn(new KafkaConsumerFactory(config))

This is because my cluster uses SASL authentication mechanism, and the
small consumer created to fetch the topics metadata was throwing that
error.

There are other couple things I noticed:

- Now I have a lot of backpressure, I assigned x3 resources to the
cluster and even with that the back pressure is high . Any advice on
this? I already increased the shards to equal the number of partitions
of the destination topic.

- I have an error where
"State exists for shard mytopic-0, but there is no state stored with
Kafka topic mytopic' group id myconsumergroup'

The only way I found to recover from this error is to change the group
name. Any other advice on how to recover from this error?


Thank you very much for following this up!

On Tue, Jun 18, 2024 at 8:44 AM Ruben Va

Re: Exactly once KafkaIO with flink runner

2024-06-24 Thread Ruben Vargas
On Mon, Jun 24, 2024 at 2:02 AM Jan Lukavský  wrote:
>
> Hi,
>
> the distribution of keys to workers might not be uniform, when the
> number of keys is comparable to total parallelism. General advise would be:
>
>   a) try to increase number of keys (EOS parallelism in this case) to be
> at least several times higher than parallelism

Make sense, unfortunately I faced an error when I tried to put the
shards > partitions. :(

"message": 
"PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
ids -> ToGBKResult ->
PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
(4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException\n\tat

Do I need to move any configuration to do that?

Thanks

>
>   b) increase maxParallelism (default 128, maximum 32768), as it might
> influence the assignment of keys to downstream workers
>
> Best,
>
>   Jan
>
> On 6/21/24 05:25, Ruben Vargas wrote:
> > Image as not correctly attached. sending it again. Sorry
> >
> > Thanks
> >
> > On Thu, Jun 20, 2024 at 9:25 PM Ruben Vargas  
> > wrote:
> >> Hello guys, me again
> >>
> >> I was trying to debug the issue with the  backpressure and I noticed
> >> that even if I set the shards = 16, not all tasks are receiving
> >> messages (attaching screenshot). You know potential causes and
> >> solutions?
> >>
> >> I really appreciate any help you can provide
> >>
> >>
> >> Thank you very much!
> >>
> >> Regards.
> >>
> >>
> >> On Wed, Jun 19, 2024 at 11:09 PM Ruben Vargas  
> >> wrote:
> >>> Hello again
> >>>
> >>> Thank you for all the suggestions.
> >>>
> >>> Unfortunately if I put more shards than partitions it throws me this 
> >>> exception
> >>>
> >>> "message": 
> >>> "PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
> >>> ids -> ToGBKResult ->
> >>> PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
> >>> to Kafka topic 
> >>> 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
> >>> (4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
> >>> FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
> >>> java.lang.RuntimeException:
> >>> java.lang.reflect.InvocationTargetException\n\tat
> >>> ..
> >>> ..
> >>> ..
> >>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
> >>> java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
> >>> org.apache.kafka.common.errors.TimeoutException: Timeout expired after
> >>> 6ms while awaiting AddOffsetsToTxn\n",
> >>>
> >>>
> >>> Any other alternative? Thank you very much!
> >>>
> >>> Regards
> >>>
> >>> On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:
>  Hi,
> 
>  regarding aligned vs unaligned checkpoints I recommend reading [1], it
>  explains it quite well. Generally, I would prefer unaligned checkpoints
>  in this case.
> 
>  Another thing to consider is the number of shards of the EOS sink.
>  Because how the shards are distributed among workers, it might be good
>  idea to actually increase that to some number higher than number of
>  target partitions (e.g. targetPartitions * 10 or so). Additional thing
>  to consider is increasing maxParallelism of the pipeline (e.g. max value
>  is 32768), as it also affects how 'evenly' Flink assigns shards to
>  workers. You can check if the assignment is even using counters in the
>  sink operator(s).
> 
> Jan
> 
>  [1]
>  https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/
> 
>  On 6/19/24 05:15, Ruben Vargas wrote:
> > Hello guys
> >
> > Now I was able to pass that error.
> >
> > I had to set the consumer factory function
> > .withConsumerFactoryFn(new KafkaConsumerFactory(config))
> >
> > This is because my cluster uses SASL authentication mechanism, and the
> > small consumer created to fetch the topics metadata was throwing that
> > error.
> >
> > There are other couple things I noticed:
> >
> >- Now I have a lot of backpressure, I assigned x3 resources to the
> > cluster and even with that the back pressure is high . Any advice on
> > this? I already increased the shards to equal the number of partitions
> > of the destination topic.
> >
> > - I have an error where
> > "State exists for shard mytopic-0, but there is no state stored with
> > Kafka topic mytopic' group id myconsumergroup'
> >
> > The only way I found to recover from this error is to change the group
> > name. Any other advice on how to recover from this error?
> >
> >
> > Thank you very mu

Re: Exactly once KafkaIO with flink runner

2024-06-24 Thread Jan Lukavský

Hi,

the distribution of keys to workers might not be uniform, when the 
number of keys is comparable to total parallelism. General advise would be:


 a) try to increase number of keys (EOS parallelism in this case) to be 
at least several times higher than parallelism


 b) increase maxParallelism (default 128, maximum 32768), as it might 
influence the assignment of keys to downstream workers


Best,

 Jan

On 6/21/24 05:25, Ruben Vargas wrote:

Image as not correctly attached. sending it again. Sorry

Thanks

On Thu, Jun 20, 2024 at 9:25 PM Ruben Vargas  wrote:

Hello guys, me again

I was trying to debug the issue with the  backpressure and I noticed
that even if I set the shards = 16, not all tasks are receiving
messages (attaching screenshot). You know potential causes and
solutions?

I really appreciate any help you can provide


Thank you very much!

Regards.


On Wed, Jun 19, 2024 at 11:09 PM Ruben Vargas  wrote:

Hello again

Thank you for all the suggestions.

Unfortunately if I put more shards than partitions it throws me this exception

"message": 
"PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
ids -> ToGBKResult ->
PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
(4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException\n\tat
..
..
..
org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
org.apache.kafka.common.errors.TimeoutException: Timeout expired after
6ms while awaiting AddOffsetsToTxn\n",


Any other alternative? Thank you very much!

Regards

On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:

Hi,

regarding aligned vs unaligned checkpoints I recommend reading [1], it
explains it quite well. Generally, I would prefer unaligned checkpoints
in this case.

Another thing to consider is the number of shards of the EOS sink.
Because how the shards are distributed among workers, it might be good
idea to actually increase that to some number higher than number of
target partitions (e.g. targetPartitions * 10 or so). Additional thing
to consider is increasing maxParallelism of the pipeline (e.g. max value
is 32768), as it also affects how 'evenly' Flink assigns shards to
workers. You can check if the assignment is even using counters in the
sink operator(s).

   Jan

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/

On 6/19/24 05:15, Ruben Vargas wrote:

Hello guys

Now I was able to pass that error.

I had to set the consumer factory function
.withConsumerFactoryFn(new KafkaConsumerFactory(config))

This is because my cluster uses SASL authentication mechanism, and the
small consumer created to fetch the topics metadata was throwing that
error.

There are other couple things I noticed:

   - Now I have a lot of backpressure, I assigned x3 resources to the
cluster and even with that the back pressure is high . Any advice on
this? I already increased the shards to equal the number of partitions
of the destination topic.

- I have an error where
"State exists for shard mytopic-0, but there is no state stored with
Kafka topic mytopic' group id myconsumergroup'

The only way I found to recover from this error is to change the group
name. Any other advice on how to recover from this error?


Thank you very much for following this up!

On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  wrote:

Hello Jan

Thanks for the suggestions

Any benefit of using aligned vs unaligned?


At the end I found one problem that was preventing  flink from doing
the checkpointing. It was a DoFn function that has some "non
serializable" objects, so I made those transient and initialized those
on the setup.

Weird, because I usually was able to detect these kinds of errors just
running in the direct runner, or even in flink before enabling EOS.


Now I'm facing another weird issue

org.apache.beam.sdk.util.UserCodeException:
org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
expired before the last committed offset for partitions
[behavioral-signals-6] could be determined. Try tuning
default.api.timeout.ms larger to relax the threshold.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
at 
org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
Source)
at 
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)

I tried to extend the timeout and it didn't work, my shards are equal
to my number of partitions.

I appreciate any kind of guidance

Thanks.

On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:

I'd suggest:
   a) use unaligned check

Re: Exactly once KafkaIO with flink runner

2024-06-20 Thread Ruben Vargas
Image as not correctly attached. sending it again. Sorry

Thanks

On Thu, Jun 20, 2024 at 9:25 PM Ruben Vargas  wrote:
>
> Hello guys, me again
>
> I was trying to debug the issue with the  backpressure and I noticed
> that even if I set the shards = 16, not all tasks are receiving
> messages (attaching screenshot). You know potential causes and
> solutions?
>
> I really appreciate any help you can provide
>
>
> Thank you very much!
>
> Regards.
>
>
> On Wed, Jun 19, 2024 at 11:09 PM Ruben Vargas  wrote:
> >
> > Hello again
> >
> > Thank you for all the suggestions.
> >
> > Unfortunately if I put more shards than partitions it throws me this 
> > exception
> >
> > "message": 
> > "PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
> > ids -> ToGBKResult ->
> > PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
> > to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
> > (4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
> > FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
> > java.lang.RuntimeException:
> > java.lang.reflect.InvocationTargetException\n\tat
> > ..
> > ..
> > ..
> > org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
> > java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
> > org.apache.kafka.common.errors.TimeoutException: Timeout expired after
> > 6ms while awaiting AddOffsetsToTxn\n",
> >
> >
> > Any other alternative? Thank you very much!
> >
> > Regards
> >
> > On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:
> > >
> > > Hi,
> > >
> > > regarding aligned vs unaligned checkpoints I recommend reading [1], it
> > > explains it quite well. Generally, I would prefer unaligned checkpoints
> > > in this case.
> > >
> > > Another thing to consider is the number of shards of the EOS sink.
> > > Because how the shards are distributed among workers, it might be good
> > > idea to actually increase that to some number higher than number of
> > > target partitions (e.g. targetPartitions * 10 or so). Additional thing
> > > to consider is increasing maxParallelism of the pipeline (e.g. max value
> > > is 32768), as it also affects how 'evenly' Flink assigns shards to
> > > workers. You can check if the assignment is even using counters in the
> > > sink operator(s).
> > >
> > >   Jan
> > >
> > > [1]
> > > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/
> > >
> > > On 6/19/24 05:15, Ruben Vargas wrote:
> > > > Hello guys
> > > >
> > > > Now I was able to pass that error.
> > > >
> > > > I had to set the consumer factory function
> > > > .withConsumerFactoryFn(new KafkaConsumerFactory(config))
> > > >
> > > > This is because my cluster uses SASL authentication mechanism, and the
> > > > small consumer created to fetch the topics metadata was throwing that
> > > > error.
> > > >
> > > > There are other couple things I noticed:
> > > >
> > > >   - Now I have a lot of backpressure, I assigned x3 resources to the
> > > > cluster and even with that the back pressure is high . Any advice on
> > > > this? I already increased the shards to equal the number of partitions
> > > > of the destination topic.
> > > >
> > > > - I have an error where
> > > > "State exists for shard mytopic-0, but there is no state stored with
> > > > Kafka topic mytopic' group id myconsumergroup'
> > > >
> > > > The only way I found to recover from this error is to change the group
> > > > name. Any other advice on how to recover from this error?
> > > >
> > > >
> > > > Thank you very much for following this up!
> > > >
> > > > On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  
> > > > wrote:
> > > >> Hello Jan
> > > >>
> > > >> Thanks for the suggestions
> > > >>
> > > >> Any benefit of using aligned vs unaligned?
> > > >>
> > > >>
> > > >> At the end I found one problem that was preventing  flink from doing
> > > >> the checkpointing. It was a DoFn function that has some "non
> > > >> serializable" objects, so I made those transient and initialized those
> > > >> on the setup.
> > > >>
> > > >> Weird, because I usually was able to detect these kinds of errors just
> > > >> running in the direct runner, or even in flink before enabling EOS.
> > > >>
> > > >>
> > > >> Now I'm facing another weird issue
> > > >>
> > > >> org.apache.beam.sdk.util.UserCodeException:
> > > >> org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
> > > >> expired before the last committed offset for partitions
> > > >> [behavioral-signals-6] could be determined. Try tuning
> > > >> default.api.timeout.ms larger to relax the threshold.
> > > >> at 
> > > >> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
> > > >> at 
> > > >> org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
> > > >> Source)
> > > >> at 
> > > >> org.apache.beam.runners.c

Re: Exactly once KafkaIO with flink runner

2024-06-20 Thread Ruben Vargas
Hello guys, me again

I was trying to debug the issue with the  backpressure and I noticed
that even if I set the shards = 16, not all tasks are receiving
messages (attaching screenshot). You know potential causes and
solutions?

I really appreciate any help you can provide


Thank you very much!

Regards.


On Wed, Jun 19, 2024 at 11:09 PM Ruben Vargas  wrote:
>
> Hello again
>
> Thank you for all the suggestions.
>
> Unfortunately if I put more shards than partitions it throws me this exception
>
> "message": 
> "PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
> ids -> ToGBKResult ->
> PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
> to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
> (4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
> FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
> java.lang.RuntimeException:
> java.lang.reflect.InvocationTargetException\n\tat
> ..
> ..
> ..
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
> java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after
> 6ms while awaiting AddOffsetsToTxn\n",
>
>
> Any other alternative? Thank you very much!
>
> Regards
>
> On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:
> >
> > Hi,
> >
> > regarding aligned vs unaligned checkpoints I recommend reading [1], it
> > explains it quite well. Generally, I would prefer unaligned checkpoints
> > in this case.
> >
> > Another thing to consider is the number of shards of the EOS sink.
> > Because how the shards are distributed among workers, it might be good
> > idea to actually increase that to some number higher than number of
> > target partitions (e.g. targetPartitions * 10 or so). Additional thing
> > to consider is increasing maxParallelism of the pipeline (e.g. max value
> > is 32768), as it also affects how 'evenly' Flink assigns shards to
> > workers. You can check if the assignment is even using counters in the
> > sink operator(s).
> >
> >   Jan
> >
> > [1]
> > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/
> >
> > On 6/19/24 05:15, Ruben Vargas wrote:
> > > Hello guys
> > >
> > > Now I was able to pass that error.
> > >
> > > I had to set the consumer factory function
> > > .withConsumerFactoryFn(new KafkaConsumerFactory(config))
> > >
> > > This is because my cluster uses SASL authentication mechanism, and the
> > > small consumer created to fetch the topics metadata was throwing that
> > > error.
> > >
> > > There are other couple things I noticed:
> > >
> > >   - Now I have a lot of backpressure, I assigned x3 resources to the
> > > cluster and even with that the back pressure is high . Any advice on
> > > this? I already increased the shards to equal the number of partitions
> > > of the destination topic.
> > >
> > > - I have an error where
> > > "State exists for shard mytopic-0, but there is no state stored with
> > > Kafka topic mytopic' group id myconsumergroup'
> > >
> > > The only way I found to recover from this error is to change the group
> > > name. Any other advice on how to recover from this error?
> > >
> > >
> > > Thank you very much for following this up!
> > >
> > > On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  
> > > wrote:
> > >> Hello Jan
> > >>
> > >> Thanks for the suggestions
> > >>
> > >> Any benefit of using aligned vs unaligned?
> > >>
> > >>
> > >> At the end I found one problem that was preventing  flink from doing
> > >> the checkpointing. It was a DoFn function that has some "non
> > >> serializable" objects, so I made those transient and initialized those
> > >> on the setup.
> > >>
> > >> Weird, because I usually was able to detect these kinds of errors just
> > >> running in the direct runner, or even in flink before enabling EOS.
> > >>
> > >>
> > >> Now I'm facing another weird issue
> > >>
> > >> org.apache.beam.sdk.util.UserCodeException:
> > >> org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
> > >> expired before the last committed offset for partitions
> > >> [behavioral-signals-6] could be determined. Try tuning
> > >> default.api.timeout.ms larger to relax the threshold.
> > >> at 
> > >> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
> > >> at 
> > >> org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
> > >> Source)
> > >> at 
> > >> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)
> > >>
> > >> I tried to extend the timeout and it didn't work, my shards are equal
> > >> to my number of partitions.
> > >>
> > >> I appreciate any kind of guidance
> > >>
> > >> Thanks.
> > >>
> > >> On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:
> > >>> I'd suggest:
> > >>>   a) use unaligned ch

Re: Paralalelism of a side input

2024-06-20 Thread Ruben Vargas
Only bad thing for this approach is, at least in the flink runner it
consume a task slot :(

El El mié, 12 de jun de 2024 a la(s) 9:38 a.m., Robert Bradshaw <
rober...@google.com> escribió:

> On Wed, Jun 12, 2024 at 7:56 AM Ruben Vargas 
> wrote:
> >
> > The approach looks good. but one question
> >
> > My understanding is that this will schedule for example 8 operators
> across the workers, but only one of them will be processing, the others
> remain idle? Are those consuming resources in some way? I'm assuming may be
> is not significant.
>
> That is correct, but the resources consumed by an idle operator should
> be negligible.
>
> > Thanks.
> >
> > El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user <
> user@beam.apache.org> escribió:
> >>
> >> You can always limit the parallelism by assigning a single key to
> >> every element and then doing a grouping or reshuffle[1] on that key
> >> before processing the elements. Even if the operator parallelism for
> >> that step is technically, say, eight, your effective parallelism will
> >> be exactly one.
> >>
> >> [1]
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html
> >>
> >> On Fri, Jun 7, 2024 at 2:13 PM Ruben Vargas 
> wrote:
> >> >
> >> > Hello guys
> >> >
> >> > One question, I have a side input which fetches an endpoint each 30
> >> > min, I pretty much copied the example here:
> >> > https://beam.apache.org/documentation/patterns/side-inputs/ but added
> >> > some logic to fetch the endpoint and parse the payload.
> >> >
> >> > My question is: it is possible to control the parallelism of this
> >> > single ParDo that does the fetch/transform? I don't think I need a lot
> >> > of parallelism for that one. I'm currently using flink runner and I
> >> > see the parallelism is 8 (which is the general parallelism for my
> >> > flink cluster).
> >> >
> >> > Is it possible to set it to 1 for example?
> >> >
> >> >
> >> > Regards.
>


Re: Exactly once KafkaIO with flink runner

2024-06-19 Thread Ruben Vargas
Hello again

Thank you for all the suggestions.

Unfortunately if I put more shards than partitions it throws me this exception

"message": 
"PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Persist
ids -> ToGBKResult ->
PipelineBuilder-debug-output/KafkaIO.Write/KafkaIO.WriteRecords/KafkaExactlyOnceSink/Write
to Kafka topic 'behavioral-signals-log-stream'/ParMultiDo(ExactlyOnceWriter)
(4/8)#0 (76ed5be34c202de19384b829f09d6346) switched from RUNNING to
FAILED with failure cause: org.apache.beam.sdk.util.UserCodeException:
java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException\n\tat
..
..
..
org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)\n\tat
java.base/java.lang.Thread.run(Thread.java:829)\nCaused by:
org.apache.kafka.common.errors.TimeoutException: Timeout expired after
6ms while awaiting AddOffsetsToTxn\n",


Any other alternative? Thank you very much!

Regards

On Wed, Jun 19, 2024 at 1:00 AM Jan Lukavský  wrote:
>
> Hi,
>
> regarding aligned vs unaligned checkpoints I recommend reading [1], it
> explains it quite well. Generally, I would prefer unaligned checkpoints
> in this case.
>
> Another thing to consider is the number of shards of the EOS sink.
> Because how the shards are distributed among workers, it might be good
> idea to actually increase that to some number higher than number of
> target partitions (e.g. targetPartitions * 10 or so). Additional thing
> to consider is increasing maxParallelism of the pipeline (e.g. max value
> is 32768), as it also affects how 'evenly' Flink assigns shards to
> workers. You can check if the assignment is even using counters in the
> sink operator(s).
>
>   Jan
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/
>
> On 6/19/24 05:15, Ruben Vargas wrote:
> > Hello guys
> >
> > Now I was able to pass that error.
> >
> > I had to set the consumer factory function
> > .withConsumerFactoryFn(new KafkaConsumerFactory(config))
> >
> > This is because my cluster uses SASL authentication mechanism, and the
> > small consumer created to fetch the topics metadata was throwing that
> > error.
> >
> > There are other couple things I noticed:
> >
> >   - Now I have a lot of backpressure, I assigned x3 resources to the
> > cluster and even with that the back pressure is high . Any advice on
> > this? I already increased the shards to equal the number of partitions
> > of the destination topic.
> >
> > - I have an error where
> > "State exists for shard mytopic-0, but there is no state stored with
> > Kafka topic mytopic' group id myconsumergroup'
> >
> > The only way I found to recover from this error is to change the group
> > name. Any other advice on how to recover from this error?
> >
> >
> > Thank you very much for following this up!
> >
> > On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  
> > wrote:
> >> Hello Jan
> >>
> >> Thanks for the suggestions
> >>
> >> Any benefit of using aligned vs unaligned?
> >>
> >>
> >> At the end I found one problem that was preventing  flink from doing
> >> the checkpointing. It was a DoFn function that has some "non
> >> serializable" objects, so I made those transient and initialized those
> >> on the setup.
> >>
> >> Weird, because I usually was able to detect these kinds of errors just
> >> running in the direct runner, or even in flink before enabling EOS.
> >>
> >>
> >> Now I'm facing another weird issue
> >>
> >> org.apache.beam.sdk.util.UserCodeException:
> >> org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
> >> expired before the last committed offset for partitions
> >> [behavioral-signals-6] could be determined. Try tuning
> >> default.api.timeout.ms larger to relax the threshold.
> >> at 
> >> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
> >> at 
> >> org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
> >> Source)
> >> at 
> >> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)
> >>
> >> I tried to extend the timeout and it didn't work, my shards are equal
> >> to my number of partitions.
> >>
> >> I appreciate any kind of guidance
> >>
> >> Thanks.
> >>
> >> On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:
> >>> I'd suggest:
> >>>   a) use unaligned checkpoints, if possible
> >>>
> >>>   b) verify the number of buckets you use for EOS sink, this limits 
> >>> parallelism [1].
> >>>
> >>> Best,
> >>>
> >>>   Jan
> >>>
> >>> [1] 
> >>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.WriteRecords.html#withEOS-int-java.lang.String-
> >>>
> >>> On 6/18/24 09:32, Ruben Vargas wrote:
> >>>
> >>> Hello Lukavsky
> >>>
> >>> Thanks for your reply !
> >>>
> >>> I thought was due backpreassure but i increased the resources of the 
> >>> cluster and problem still presist. More that that, data stop flowing and 
> >>> th

Re: KafkaIO metric publishing

2024-06-19 Thread XQ Hu via user
Is your job a Dataflow Template job?

The error is caused by
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DataflowTemplateJob.java#L55
.

So basically DataflowTemplateJob does not support metrics.


On Wed, Jun 19, 2024 at 3:57 AM Lahiru Ginnaliya Gamathige <
glah...@gmail.com> wrote:

> Hi Users,
>
> In Google Cloud monitoring there is a limit of 100 metrics and when we are
> using KafkaIO, the library publishes a bunch of metrics per topic. With our
> use we will easily run out of 100 metric limit.
>
> We want to stop KafkaIO from publishing metrics and I do not see this is
> configurable. So I am trying to write a metric filtering logic (we are
> using beam version 2.55.1).
> I wrote a Sink but when I try to find a way to register the sink I cannot
> see a way to do the following in this beam version,
>
> *MetricsEnvironment.setMetricsSink(new
> CustomMetricsSink(options.getProject()));*
>
> Then I tried to register it like this,
>
> PipelineResult results = run(options);
> results.waitUntilFinish();
>
>
>
> *   MetricQueryResults metricQueryResults =
> results.metrics().queryMetrics(MetricsFilter.builder().build());
> CustomMetricSink reporter = new CustomMetricSink(options.getProject());
> reporter.writeMetrics(metricQueryResults);*
>
> With the above code pipeline is failing to start with the 
> error(java.lang.UnsupportedOperationException:
> The result of template creation should not be used.)
>
>
> Do you suggest another solution for this problem (it sounds like a quite
> common problem when using kafkaio). Or do you have any suggestion about my
> attempts ?
>
> Regards
> Lahiru
>
>


Re: Exactly once KafkaIO with flink runner

2024-06-19 Thread Jan Lukavský

Hi,

regarding aligned vs unaligned checkpoints I recommend reading [1], it 
explains it quite well. Generally, I would prefer unaligned checkpoints 
in this case.


Another thing to consider is the number of shards of the EOS sink. 
Because how the shards are distributed among workers, it might be good 
idea to actually increase that to some number higher than number of 
target partitions (e.g. targetPartitions * 10 or so). Additional thing 
to consider is increasing maxParallelism of the pipeline (e.g. max value 
is 32768), as it also affects how 'evenly' Flink assigns shards to 
workers. You can check if the assignment is even using counters in the 
sink operator(s).


 Jan

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/


On 6/19/24 05:15, Ruben Vargas wrote:

Hello guys

Now I was able to pass that error.

I had to set the consumer factory function
.withConsumerFactoryFn(new KafkaConsumerFactory(config))

This is because my cluster uses SASL authentication mechanism, and the
small consumer created to fetch the topics metadata was throwing that
error.

There are other couple things I noticed:

  - Now I have a lot of backpressure, I assigned x3 resources to the
cluster and even with that the back pressure is high . Any advice on
this? I already increased the shards to equal the number of partitions
of the destination topic.

- I have an error where
"State exists for shard mytopic-0, but there is no state stored with
Kafka topic mytopic' group id myconsumergroup'

The only way I found to recover from this error is to change the group
name. Any other advice on how to recover from this error?


Thank you very much for following this up!

On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  wrote:

Hello Jan

Thanks for the suggestions

Any benefit of using aligned vs unaligned?


At the end I found one problem that was preventing  flink from doing
the checkpointing. It was a DoFn function that has some "non
serializable" objects, so I made those transient and initialized those
on the setup.

Weird, because I usually was able to detect these kinds of errors just
running in the direct runner, or even in flink before enabling EOS.


Now I'm facing another weird issue

org.apache.beam.sdk.util.UserCodeException:
org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
expired before the last committed offset for partitions
[behavioral-signals-6] could be determined. Try tuning
default.api.timeout.ms larger to relax the threshold.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
at 
org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
Source)
at 
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)

I tried to extend the timeout and it didn't work, my shards are equal
to my number of partitions.

I appreciate any kind of guidance

Thanks.

On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:

I'd suggest:
  a) use unaligned checkpoints, if possible

  b) verify the number of buckets you use for EOS sink, this limits parallelism 
[1].

Best,

  Jan

[1] 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.WriteRecords.html#withEOS-int-java.lang.String-

On 6/18/24 09:32, Ruben Vargas wrote:

Hello Lukavsky

Thanks for your reply !

I thought was due backpreassure but i increased the resources of the cluster 
and problem still presist. More that that, data stop flowing and the checkpoint 
still fail.

I have configured the checkpoint to do it per minute. The timeout is 1h. Is 
aligned checkpoint.

El El mar, 18 de jun de 2024 a la(s) 1:14 a.m., Jan Lukavský  
escribió:

H Ruben,

from the provided screenshot it seems to me, that the pipeline in
backpressured by the sink. Can you please share your checkpoint
configuration? Are you using unaligned checkpoints? What is the
checkpointing interval and the volume of data coming in from the source?
With EOS data is committed after checkpoint, before that, the data is
buffered in state, which makes the sink more resource intensive.

   Jan

On 6/18/24 05:30, Ruben Vargas wrote:

Attached a better image of the console.

Thanks!

On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas  wrote:

Hello guys

Wondering if some of you have experiences enabling Exactly Once in
KafkaIO with Flink runner? I enabled it and now I'm facing an issue
where all the checkpoints are failing. I cannot see any exception on
the logs.

Flink console only mentions this "Asynchronous task checkpoint
failed." I also noticed that some operators don't acknowledge the
checkpointing  (Attached a screenshot).

I did this:

1) KafkaIO.Read:

update consumer properties with enable.auto.commit = false
.withReadCommitted()
.commitOffsetsInFinalize()

2) KafkaIO#write:

.withEOS(numShards, sinkGroupId)

But my application is not able to deliver messages to the output topic
due the checkpoint fa

Re: Exactly once KafkaIO with flink runner

2024-06-18 Thread Ruben Vargas
Hello guys

Now I was able to pass that error.

I had to set the consumer factory function
.withConsumerFactoryFn(new KafkaConsumerFactory(config))

This is because my cluster uses SASL authentication mechanism, and the
small consumer created to fetch the topics metadata was throwing that
error.

There are other couple things I noticed:

 - Now I have a lot of backpressure, I assigned x3 resources to the
cluster and even with that the back pressure is high . Any advice on
this? I already increased the shards to equal the number of partitions
of the destination topic.

- I have an error where
"State exists for shard mytopic-0, but there is no state stored with
Kafka topic mytopic' group id myconsumergroup'

The only way I found to recover from this error is to change the group
name. Any other advice on how to recover from this error?


Thank you very much for following this up!

On Tue, Jun 18, 2024 at 8:44 AM Ruben Vargas  wrote:
>
> Hello Jan
>
> Thanks for the suggestions
>
> Any benefit of using aligned vs unaligned?
>
>
> At the end I found one problem that was preventing  flink from doing
> the checkpointing. It was a DoFn function that has some "non
> serializable" objects, so I made those transient and initialized those
> on the setup.
>
> Weird, because I usually was able to detect these kinds of errors just
> running in the direct runner, or even in flink before enabling EOS.
>
>
> Now I'm facing another weird issue
>
> org.apache.beam.sdk.util.UserCodeException:
> org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
> expired before the last committed offset for partitions
> [behavioral-signals-6] could be determined. Try tuning
> default.api.timeout.ms larger to relax the threshold.
> at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
> at 
> org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
> Source)
> at 
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)
>
> I tried to extend the timeout and it didn't work, my shards are equal
> to my number of partitions.
>
> I appreciate any kind of guidance
>
> Thanks.
>
> On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:
> >
> > I'd suggest:
> >  a) use unaligned checkpoints, if possible
> >
> >  b) verify the number of buckets you use for EOS sink, this limits 
> > parallelism [1].
> >
> > Best,
> >
> >  Jan
> >
> > [1] 
> > https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.WriteRecords.html#withEOS-int-java.lang.String-
> >
> > On 6/18/24 09:32, Ruben Vargas wrote:
> >
> > Hello Lukavsky
> >
> > Thanks for your reply !
> >
> > I thought was due backpreassure but i increased the resources of the 
> > cluster and problem still presist. More that that, data stop flowing and 
> > the checkpoint still fail.
> >
> > I have configured the checkpoint to do it per minute. The timeout is 1h. Is 
> > aligned checkpoint.
> >
> > El El mar, 18 de jun de 2024 a la(s) 1:14 a.m., Jan Lukavský 
> >  escribió:
> >>
> >> H Ruben,
> >>
> >> from the provided screenshot it seems to me, that the pipeline in
> >> backpressured by the sink. Can you please share your checkpoint
> >> configuration? Are you using unaligned checkpoints? What is the
> >> checkpointing interval and the volume of data coming in from the source?
> >> With EOS data is committed after checkpoint, before that, the data is
> >> buffered in state, which makes the sink more resource intensive.
> >>
> >>   Jan
> >>
> >> On 6/18/24 05:30, Ruben Vargas wrote:
> >> > Attached a better image of the console.
> >> >
> >> > Thanks!
> >> >
> >> > On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas  
> >> > wrote:
> >> >> Hello guys
> >> >>
> >> >> Wondering if some of you have experiences enabling Exactly Once in
> >> >> KafkaIO with Flink runner? I enabled it and now I'm facing an issue
> >> >> where all the checkpoints are failing. I cannot see any exception on
> >> >> the logs.
> >> >>
> >> >> Flink console only mentions this "Asynchronous task checkpoint
> >> >> failed." I also noticed that some operators don't acknowledge the
> >> >> checkpointing  (Attached a screenshot).
> >> >>
> >> >> I did this:
> >> >>
> >> >> 1) KafkaIO.Read:
> >> >>
> >> >> update consumer properties with enable.auto.commit = false
> >> >> .withReadCommitted()
> >> >> .commitOffsetsInFinalize()
> >> >>
> >> >> 2) KafkaIO#write:
> >> >>
> >> >> .withEOS(numShards, sinkGroupId)
> >> >>
> >> >> But my application is not able to deliver messages to the output topic
> >> >> due the checkpoint failing.
> >> >> I also reviewed the timeout and other time sensitive parameters, those
> >> >> are high right now.
> >> >>
> >> >> I really appreciate your guidance on this. Thank you


Re: (Python) How to test pipeline that raises an exception?

2024-06-18 Thread Valentyn Tymofieiev via user
Hi Jaehyeon,

for exceptions that happen at pipeline construction, it should be possible
to use unittest.assertRaises.

For errors at runtime, it should be possible to detect that the pipeline
has finished in a 'FAILED` state. You can capture the state by running the
pipeline without `with` decorator and capturing pipeline_result =
p.run().wait_until_finish(), and analyzing it, or by trying
out PipelineStateMatcher, from a quick look, we have some tests that use
it:
https://github.com/apache/beam/blob/3ed91c880f85d09a45039e70d5136f1c2324916d/sdks/python/apache_beam/examples/complete/game/hourly_team_score_it_test.py#L68
,
https://github.com/apache/beam/blob/3ed91c880f85d09a45039e70d5136f1c2324916d/sdks/python/apache_beam/examples/complete/game/hourly_team_score_it_test.py#L80
, and it should be possible to create a matcher for a failed state.


On Tue, Jun 18, 2024 at 12:22 PM Jaehyeon Kim  wrote:

> Hello,
>
> I have a unit testing case and the pipeline raises TypeError. How is it
> possible to test? (I don't find unittest assertRaises method in the beam
> testing util)
>
> Cheers,
> Jaehyeon
>
> def test_write_to_firehose_with_unsupported_types(self):
> with TestPipeline(options=self.pipeline_opts) as p:
> output = (
> p
> | beam.Create(["one", "two", "three", "four"])
> | "WriteToFirehose" >> WriteToFirehose(self.
> delivery_stream_name, True)
> | "CollectResponse" >> beam.Map(lambda e: e[
> "FailedPutCount"])
> )
>


Re: Exactly once KafkaIO with flink runner

2024-06-18 Thread Ruben Vargas
Hello Jan

Thanks for the suggestions

Any benefit of using aligned vs unaligned?


At the end I found one problem that was preventing  flink from doing
the checkpointing. It was a DoFn function that has some "non
serializable" objects, so I made those transient and initialized those
on the setup.

Weird, because I usually was able to detect these kinds of errors just
running in the direct runner, or even in flink before enabling EOS.


Now I'm facing another weird issue

org.apache.beam.sdk.util.UserCodeException:
org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms
expired before the last committed offset for partitions
[behavioral-signals-6] could be determined. Try tuning
default.api.timeout.ms larger to relax the threshold.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
at 
org.apache.beam.sdk.io.kafka.KafkaExactlyOnceSink$ExactlyOnceWriter$DoFnInvoker.invokeProcessElement(Unknown
Source)
at 
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)

I tried to extend the timeout and it didn't work, my shards are equal
to my number of partitions.

I appreciate any kind of guidance

Thanks.

On Tue, Jun 18, 2024 at 5:56 AM Jan Lukavský  wrote:
>
> I'd suggest:
>  a) use unaligned checkpoints, if possible
>
>  b) verify the number of buckets you use for EOS sink, this limits 
> parallelism [1].
>
> Best,
>
>  Jan
>
> [1] 
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.WriteRecords.html#withEOS-int-java.lang.String-
>
> On 6/18/24 09:32, Ruben Vargas wrote:
>
> Hello Lukavsky
>
> Thanks for your reply !
>
> I thought was due backpreassure but i increased the resources of the cluster 
> and problem still presist. More that that, data stop flowing and the 
> checkpoint still fail.
>
> I have configured the checkpoint to do it per minute. The timeout is 1h. Is 
> aligned checkpoint.
>
> El El mar, 18 de jun de 2024 a la(s) 1:14 a.m., Jan Lukavský 
>  escribió:
>>
>> H Ruben,
>>
>> from the provided screenshot it seems to me, that the pipeline in
>> backpressured by the sink. Can you please share your checkpoint
>> configuration? Are you using unaligned checkpoints? What is the
>> checkpointing interval and the volume of data coming in from the source?
>> With EOS data is committed after checkpoint, before that, the data is
>> buffered in state, which makes the sink more resource intensive.
>>
>>   Jan
>>
>> On 6/18/24 05:30, Ruben Vargas wrote:
>> > Attached a better image of the console.
>> >
>> > Thanks!
>> >
>> > On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas  
>> > wrote:
>> >> Hello guys
>> >>
>> >> Wondering if some of you have experiences enabling Exactly Once in
>> >> KafkaIO with Flink runner? I enabled it and now I'm facing an issue
>> >> where all the checkpoints are failing. I cannot see any exception on
>> >> the logs.
>> >>
>> >> Flink console only mentions this "Asynchronous task checkpoint
>> >> failed." I also noticed that some operators don't acknowledge the
>> >> checkpointing  (Attached a screenshot).
>> >>
>> >> I did this:
>> >>
>> >> 1) KafkaIO.Read:
>> >>
>> >> update consumer properties with enable.auto.commit = false
>> >> .withReadCommitted()
>> >> .commitOffsetsInFinalize()
>> >>
>> >> 2) KafkaIO#write:
>> >>
>> >> .withEOS(numShards, sinkGroupId)
>> >>
>> >> But my application is not able to deliver messages to the output topic
>> >> due the checkpoint failing.
>> >> I also reviewed the timeout and other time sensitive parameters, those
>> >> are high right now.
>> >>
>> >> I really appreciate your guidance on this. Thank you


Re: Exactly once KafkaIO with flink runner

2024-06-18 Thread Jan Lukavský

I'd suggest:
 a) use unaligned checkpoints, if possible

 b) verify the number of buckets you use for EOS sink, this limits 
parallelism [1].


Best,

 Jan

[1] 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.WriteRecords.html#withEOS-int-java.lang.String-


On 6/18/24 09:32, Ruben Vargas wrote:

Hello Lukavsky

Thanks for your reply !

I thought was due backpreassure but i increased the resources of the 
cluster and problem still presist. More that that, data stop flowing 
and the checkpoint still fail.


I have configured the checkpoint to do it per minute. The timeout is 
1h. Is aligned checkpoint.


El El mar, 18 de jun de 2024 a la(s) 1:14 a.m., Jan Lukavský 
 escribió:


H Ruben,

from the provided screenshot it seems to me, that the pipeline in
backpressured by the sink. Can you please share your checkpoint
configuration? Are you using unaligned checkpoints? What is the
checkpointing interval and the volume of data coming in from the
source?
With EOS data is committed after checkpoint, before that, the data is
buffered in state, which makes the sink more resource intensive.

  Jan

On 6/18/24 05:30, Ruben Vargas wrote:
> Attached a better image of the console.
>
> Thanks!
>
> On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas
 wrote:
>> Hello guys
>>
>> Wondering if some of you have experiences enabling Exactly Once in
>> KafkaIO with Flink runner? I enabled it and now I'm facing an issue
>> where all the checkpoints are failing. I cannot see any
exception on
>> the logs.
>>
>> Flink console only mentions this "Asynchronous task checkpoint
>> failed." I also noticed that some operators don't acknowledge the
>> checkpointing  (Attached a screenshot).
>>
>> I did this:
>>
>> 1) KafkaIO.Read:
>>
>> update consumer properties with enable.auto.commit = false
>> .withReadCommitted()
>> .commitOffsetsInFinalize()
>>
>> 2) KafkaIO#write:
>>
>> .withEOS(numShards, sinkGroupId)
>>
>> But my application is not able to deliver messages to the
output topic
>> due the checkpoint failing.
>> I also reviewed the timeout and other time sensitive
parameters, those
>> are high right now.
>>
>> I really appreciate your guidance on this. Thank you


Re: Exactly once KafkaIO with flink runner

2024-06-18 Thread Ruben Vargas
Hello Lukavsky

Thanks for your reply !

I thought was due backpreassure but i increased the resources of the
cluster and problem still presist. More that that, data stop flowing and
the checkpoint still fail.

I have configured the checkpoint to do it per minute. The timeout is 1h. Is
aligned checkpoint.

El El mar, 18 de jun de 2024 a la(s) 1:14 a.m., Jan Lukavský <
je...@seznam.cz> escribió:

> H Ruben,
>
> from the provided screenshot it seems to me, that the pipeline in
> backpressured by the sink. Can you please share your checkpoint
> configuration? Are you using unaligned checkpoints? What is the
> checkpointing interval and the volume of data coming in from the source?
> With EOS data is committed after checkpoint, before that, the data is
> buffered in state, which makes the sink more resource intensive.
>
>   Jan
>
> On 6/18/24 05:30, Ruben Vargas wrote:
> > Attached a better image of the console.
> >
> > Thanks!
> >
> > On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas 
> wrote:
> >> Hello guys
> >>
> >> Wondering if some of you have experiences enabling Exactly Once in
> >> KafkaIO with Flink runner? I enabled it and now I'm facing an issue
> >> where all the checkpoints are failing. I cannot see any exception on
> >> the logs.
> >>
> >> Flink console only mentions this "Asynchronous task checkpoint
> >> failed." I also noticed that some operators don't acknowledge the
> >> checkpointing  (Attached a screenshot).
> >>
> >> I did this:
> >>
> >> 1) KafkaIO.Read:
> >>
> >> update consumer properties with enable.auto.commit = false
> >> .withReadCommitted()
> >> .commitOffsetsInFinalize()
> >>
> >> 2) KafkaIO#write:
> >>
> >> .withEOS(numShards, sinkGroupId)
> >>
> >> But my application is not able to deliver messages to the output topic
> >> due the checkpoint failing.
> >> I also reviewed the timeout and other time sensitive parameters, those
> >> are high right now.
> >>
> >> I really appreciate your guidance on this. Thank you
>


Re: Exactly once KafkaIO with flink runner

2024-06-18 Thread Jan Lukavský

H Ruben,

from the provided screenshot it seems to me, that the pipeline in 
backpressured by the sink. Can you please share your checkpoint 
configuration? Are you using unaligned checkpoints? What is the 
checkpointing interval and the volume of data coming in from the source? 
With EOS data is committed after checkpoint, before that, the data is 
buffered in state, which makes the sink more resource intensive.


 Jan

On 6/18/24 05:30, Ruben Vargas wrote:

Attached a better image of the console.

Thanks!

On Mon, Jun 17, 2024 at 9:28 PM Ruben Vargas  wrote:

Hello guys

Wondering if some of you have experiences enabling Exactly Once in
KafkaIO with Flink runner? I enabled it and now I'm facing an issue
where all the checkpoints are failing. I cannot see any exception on
the logs.

Flink console only mentions this "Asynchronous task checkpoint
failed." I also noticed that some operators don't acknowledge the
checkpointing  (Attached a screenshot).

I did this:

1) KafkaIO.Read:

update consumer properties with enable.auto.commit = false
.withReadCommitted()
.commitOffsetsInFinalize()

2) KafkaIO#write:

.withEOS(numShards, sinkGroupId)

But my application is not able to deliver messages to the output topic
due the checkpoint failing.
I also reviewed the timeout and other time sensitive parameters, those
are high right now.

I really appreciate your guidance on this. Thank you


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread Sofia’s World
Thanks. It appears that i did not read fully the documentation and i missed
this in my dataflow flex-template run

, '--parameters'
  , 'sdk_container_image=$_SDK_CONTAINER_IMAGE'

All my other jobs use a dodgy docker file which does not require  the
parameter above...
I should be fine for the time being,  at least my pipeline is not plagued
anymore by import errors
 thanks all for help ing out

kind regards
Marco




On Sun, Jun 16, 2024 at 6:27 PM Utkarsh Parekh 
wrote:

> You have “mypackage” incorrectly built. Please check and confirm that.
>
> Utkarsh
>
> On Sun, Jun 16, 2024 at 12:48 PM Sofia’s World 
> wrote:
>
>> Error is same...- see bottom -
>> i have tried to ssh in the container and the directory is setup as
>> expected.. so not quite sure where the issue is
>> i will try to start from the pipeline with dependencies sample and work
>> out from there  w.o bothering the list
>>
>> thanks again for following up
>>  Marco
>>
>> Could not load main session. Inspect which external dependencies are used
>> in the main module of your pipeline. Verify that corresponding packages are
>> installed in the pipeline runtime environment and their installed versions
>> match the versions used in pipeline submission environment. For more
>> information, see:
>> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
>> Traceback (most recent call last): File
>> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 115, in create_harness _load_main_session(semi_persistent_directory)
>> File
>> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 354, in _load_main_session pickler.load_session(session_file) File
>> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/pickler.py",
>> line 65, in load_session return desired_pickle_lib.load_session(file_path)
>> ^^ File
>> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/dill_pickler.py",
>> line 446, in load_session return dill.load_session(file_path)
>>  File
>> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 368, in
>> load_session module = unpickler.load()  File
>> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 472, in load
>> obj = StockUnpickler.load(self) ^ File
>> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 827, in
>> _import_module return getattr(__import__(module, None, None, [obj]), obj)
>> ^ ModuleNotFoundError: No module named
>> 'mypackage'
>>
>>
>>
>> On Sun, 16 Jun 2024, 14:50 XQ Hu via user,  wrote:
>>
>>> What is the error message now?
>>> You can easily ssh to your docker container and check everything is
>>> installed correctly by:
>>> docker run --rm -it --entrypoint=/bin/bash $CUSTOM_CONTAINER_IMAGE
>>>
>>>
>>> On Sun, Jun 16, 2024 at 5:18 AM Sofia’s World 
>>> wrote:
>>>
 Valentin, many thanks... i actually spotted the reference in teh setup
 file
 However , after correcting it, i am still at square 1 where somehow my
 runtime environment does not see it.. so i added some debugging to my
 Dockerfile to check if i forgot to copy something,
 and here's the output, where i can see the mypackage has been copied

 here's my directory structure

  mypackage
 __init__.py
 obbutils.py
 launcher.py
 __init__.py
 dataflow_tester.py
 setup_dftester.py (copied to setup.py)

 i can see the directory structure has been maintained when i copy my
 files to docker as i added some debug to my dockerfile

 Step #0 - "dftester-image": Removing intermediate container 4c4e763289d2
 Step #0 - "dftester-image":  ---> cda378f70a9e
 Step #0 - "dftester-image": Step 6/23 : COPY requirements.txt .
 Step #0 - "dftester-image":  ---> 9a43da08b013
 Step #0 - "dftester-image": Step 7/23 : COPY setup_dftester.py setup.py
 Step #0 - "dftester-image":  ---> 5a6bf71df052
 Step #0 - "dftester-image": Step 8/23 : COPY dataflow_tester.py .
 Step #0 - "dftester-image":  ---> 82cfe1f1f9ed
 Step #0 - "dftester-image": Step 9/23 : COPY mypackage mypackage
 Step #0 - "dftester-image":  ---> d86497b791d0
 Step #0 - "dftester-image": Step 10/23 : COPY __init__.py
 ${WORKDIR}/__init__.py
 Step #0 - "dftester-image":  ---> 337d149d64c7
 Step #0 - "dftester-image": Step 11/23 : RUN echo '- listing
 workdir'
 Step #0 - "dftester-image":  ---> Running in 9d97d8a64319
 Step #0 - "dftester-image": - listing workdir
 Step #0 - "dftester-image": Removing intermediate container 9d97d8a64319
 Step #0 - "dftester-image":  ---> bc9a6a2aa462
 Step #0 - "dftester-image": Step 12/23 : RUN ls -la ${WORKDIR}
 Step #0 - "dftester-image":  ---> Running in cf164108f9d6
 Step #0 - "dftester-image": total 24
>>>

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread Utkarsh Parekh
You have “mypackage” incorrectly built. Please check and confirm that.

Utkarsh

On Sun, Jun 16, 2024 at 12:48 PM Sofia’s World  wrote:

> Error is same...- see bottom -
> i have tried to ssh in the container and the directory is setup as
> expected.. so not quite sure where the issue is
> i will try to start from the pipeline with dependencies sample and work
> out from there  w.o bothering the list
>
> thanks again for following up
>  Marco
>
> Could not load main session. Inspect which external dependencies are used
> in the main module of your pipeline. Verify that corresponding packages are
> installed in the pipeline runtime environment and their installed versions
> match the versions used in pipeline submission environment. For more
> information, see:
> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
> Traceback (most recent call last): File
> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 115, in create_harness _load_main_session(semi_persistent_directory)
> File
> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 354, in _load_main_session pickler.load_session(session_file) File
> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/pickler.py",
> line 65, in load_session return desired_pickle_lib.load_session(file_path)
> ^^ File
> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/dill_pickler.py",
> line 446, in load_session return dill.load_session(file_path)
>  File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 368, in
> load_session module = unpickler.load()  File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) ^ File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 827, in
> _import_module return getattr(__import__(module, None, None, [obj]), obj)
> ^ ModuleNotFoundError: No module named
> 'mypackage'
>
>
>
> On Sun, 16 Jun 2024, 14:50 XQ Hu via user,  wrote:
>
>> What is the error message now?
>> You can easily ssh to your docker container and check everything is
>> installed correctly by:
>> docker run --rm -it --entrypoint=/bin/bash $CUSTOM_CONTAINER_IMAGE
>>
>>
>> On Sun, Jun 16, 2024 at 5:18 AM Sofia’s World 
>> wrote:
>>
>>> Valentin, many thanks... i actually spotted the reference in teh setup
>>> file
>>> However , after correcting it, i am still at square 1 where somehow my
>>> runtime environment does not see it.. so i added some debugging to my
>>> Dockerfile to check if i forgot to copy something,
>>> and here's the output, where i can see the mypackage has been copied
>>>
>>> here's my directory structure
>>>
>>>  mypackage
>>> __init__.py
>>> obbutils.py
>>> launcher.py
>>> __init__.py
>>> dataflow_tester.py
>>> setup_dftester.py (copied to setup.py)
>>>
>>> i can see the directory structure has been maintained when i copy my
>>> files to docker as i added some debug to my dockerfile
>>>
>>> Step #0 - "dftester-image": Removing intermediate container 4c4e763289d2
>>> Step #0 - "dftester-image":  ---> cda378f70a9e
>>> Step #0 - "dftester-image": Step 6/23 : COPY requirements.txt .
>>> Step #0 - "dftester-image":  ---> 9a43da08b013
>>> Step #0 - "dftester-image": Step 7/23 : COPY setup_dftester.py setup.py
>>> Step #0 - "dftester-image":  ---> 5a6bf71df052
>>> Step #0 - "dftester-image": Step 8/23 : COPY dataflow_tester.py .
>>> Step #0 - "dftester-image":  ---> 82cfe1f1f9ed
>>> Step #0 - "dftester-image": Step 9/23 : COPY mypackage mypackage
>>> Step #0 - "dftester-image":  ---> d86497b791d0
>>> Step #0 - "dftester-image": Step 10/23 : COPY __init__.py
>>> ${WORKDIR}/__init__.py
>>> Step #0 - "dftester-image":  ---> 337d149d64c7
>>> Step #0 - "dftester-image": Step 11/23 : RUN echo '- listing workdir'
>>> Step #0 - "dftester-image":  ---> Running in 9d97d8a64319
>>> Step #0 - "dftester-image": - listing workdir
>>> Step #0 - "dftester-image": Removing intermediate container 9d97d8a64319
>>> Step #0 - "dftester-image":  ---> bc9a6a2aa462
>>> Step #0 - "dftester-image": Step 12/23 : RUN ls -la ${WORKDIR}
>>> Step #0 - "dftester-image":  ---> Running in cf164108f9d6
>>> Step #0 - "dftester-image": total 24
>>> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 .
>>> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
>>> Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
>>> __init__.py
>>> Step #0 - "dftester-image": -rw-r--r-- 1 root root  135 Jun 16 08:57
>>> dataflow_tester.py
>>> Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59
>>> mypackage
>>> Step #0 - "dftester-image": -rw-r--r-- 1 root root   64 Jun 16 08:57
>>> requirements.txt
>>> Step #0 - "dftester-image": -rw-r--r-- 1 root root  736 Jun 16 08:57
>>> s

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread Sofia’s World
Error is same...- see bottom -
i have tried to ssh in the container and the directory is setup as
expected.. so not quite sure where the issue is
i will try to start from the pipeline with dependencies sample and work out
from there  w.o bothering the list

thanks again for following up
 Marco

Could not load main session. Inspect which external dependencies are used
in the main module of your pipeline. Verify that corresponding packages are
installed in the pipeline runtime environment and their installed versions
match the versions used in pipeline submission environment. For more
information, see:
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Traceback (most recent call last): File
"/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 115, in create_harness _load_main_session(semi_persistent_directory)
File
"/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 354, in _load_main_session pickler.load_session(session_file) File
"/usr/local/lib/python3.11/site-packages/apache_beam/internal/pickler.py",
line 65, in load_session return desired_pickle_lib.load_session(file_path)
^^ File
"/usr/local/lib/python3.11/site-packages/apache_beam/internal/dill_pickler.py",
line 446, in load_session return dill.load_session(file_path)
 File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 368, in
load_session module = unpickler.load()  File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self) ^ File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 827, in
_import_module return getattr(__import__(module, None, None, [obj]), obj)
^ ModuleNotFoundError: No module named
'mypackage'



On Sun, 16 Jun 2024, 14:50 XQ Hu via user,  wrote:

> What is the error message now?
> You can easily ssh to your docker container and check everything is
> installed correctly by:
> docker run --rm -it --entrypoint=/bin/bash $CUSTOM_CONTAINER_IMAGE
>
>
> On Sun, Jun 16, 2024 at 5:18 AM Sofia’s World  wrote:
>
>> Valentin, many thanks... i actually spotted the reference in teh setup
>> file
>> However , after correcting it, i am still at square 1 where somehow my
>> runtime environment does not see it.. so i added some debugging to my
>> Dockerfile to check if i forgot to copy something,
>> and here's the output, where i can see the mypackage has been copied
>>
>> here's my directory structure
>>
>>  mypackage
>> __init__.py
>> obbutils.py
>> launcher.py
>> __init__.py
>> dataflow_tester.py
>> setup_dftester.py (copied to setup.py)
>>
>> i can see the directory structure has been maintained when i copy my
>> files to docker as i added some debug to my dockerfile
>>
>> Step #0 - "dftester-image": Removing intermediate container 4c4e763289d2
>> Step #0 - "dftester-image":  ---> cda378f70a9e
>> Step #0 - "dftester-image": Step 6/23 : COPY requirements.txt .
>> Step #0 - "dftester-image":  ---> 9a43da08b013
>> Step #0 - "dftester-image": Step 7/23 : COPY setup_dftester.py setup.py
>> Step #0 - "dftester-image":  ---> 5a6bf71df052
>> Step #0 - "dftester-image": Step 8/23 : COPY dataflow_tester.py .
>> Step #0 - "dftester-image":  ---> 82cfe1f1f9ed
>> Step #0 - "dftester-image": Step 9/23 : COPY mypackage mypackage
>> Step #0 - "dftester-image":  ---> d86497b791d0
>> Step #0 - "dftester-image": Step 10/23 : COPY __init__.py
>> ${WORKDIR}/__init__.py
>> Step #0 - "dftester-image":  ---> 337d149d64c7
>> Step #0 - "dftester-image": Step 11/23 : RUN echo '- listing workdir'
>> Step #0 - "dftester-image":  ---> Running in 9d97d8a64319
>> Step #0 - "dftester-image": - listing workdir
>> Step #0 - "dftester-image": Removing intermediate container 9d97d8a64319
>> Step #0 - "dftester-image":  ---> bc9a6a2aa462
>> Step #0 - "dftester-image": Step 12/23 : RUN ls -la ${WORKDIR}
>> Step #0 - "dftester-image":  ---> Running in cf164108f9d6
>> Step #0 - "dftester-image": total 24
>> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 .
>> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
>> Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
>> __init__.py
>> Step #0 - "dftester-image": -rw-r--r-- 1 root root  135 Jun 16 08:57
>> dataflow_tester.py
>> Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59
>> mypackage
>> Step #0 - "dftester-image": -rw-r--r-- 1 root root   64 Jun 16 08:57
>> requirements.txt
>> Step #0 - "dftester-image": -rw-r--r-- 1 root root  736 Jun 16 08:57
>> setup.py
>> Step #0 - "dftester-image": Removing intermediate container cf164108f9d6
>> Step #0 - "dftester-image":  ---> eb1a080b7948
>> Step #0 - "dftester-image": Step 13/23 : RUN echo '--- listing modules
>> -'
>> Step #0 - "dftester-image":  ---> Running in 884f03dd8

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread XQ Hu via user
What is the error message now?
You can easily ssh to your docker container and check everything is
installed correctly by:
docker run --rm -it --entrypoint=/bin/bash $CUSTOM_CONTAINER_IMAGE


On Sun, Jun 16, 2024 at 5:18 AM Sofia’s World  wrote:

> Valentin, many thanks... i actually spotted the reference in teh setup file
> However , after correcting it, i am still at square 1 where somehow my
> runtime environment does not see it.. so i added some debugging to my
> Dockerfile to check if i forgot to copy something,
> and here's the output, where i can see the mypackage has been copied
>
> here's my directory structure
>
>  mypackage
> __init__.py
> obbutils.py
> launcher.py
> __init__.py
> dataflow_tester.py
> setup_dftester.py (copied to setup.py)
>
> i can see the directory structure has been maintained when i copy my files
> to docker as i added some debug to my dockerfile
>
> Step #0 - "dftester-image": Removing intermediate container 4c4e763289d2
> Step #0 - "dftester-image":  ---> cda378f70a9e
> Step #0 - "dftester-image": Step 6/23 : COPY requirements.txt .
> Step #0 - "dftester-image":  ---> 9a43da08b013
> Step #0 - "dftester-image": Step 7/23 : COPY setup_dftester.py setup.py
> Step #0 - "dftester-image":  ---> 5a6bf71df052
> Step #0 - "dftester-image": Step 8/23 : COPY dataflow_tester.py .
> Step #0 - "dftester-image":  ---> 82cfe1f1f9ed
> Step #0 - "dftester-image": Step 9/23 : COPY mypackage mypackage
> Step #0 - "dftester-image":  ---> d86497b791d0
> Step #0 - "dftester-image": Step 10/23 : COPY __init__.py
> ${WORKDIR}/__init__.py
> Step #0 - "dftester-image":  ---> 337d149d64c7
> Step #0 - "dftester-image": Step 11/23 : RUN echo '- listing workdir'
> Step #0 - "dftester-image":  ---> Running in 9d97d8a64319
> Step #0 - "dftester-image": - listing workdir
> Step #0 - "dftester-image": Removing intermediate container 9d97d8a64319
> Step #0 - "dftester-image":  ---> bc9a6a2aa462
> Step #0 - "dftester-image": Step 12/23 : RUN ls -la ${WORKDIR}
> Step #0 - "dftester-image":  ---> Running in cf164108f9d6
> Step #0 - "dftester-image": total 24
> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 .
> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
> Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
> __init__.py
> Step #0 - "dftester-image": -rw-r--r-- 1 root root  135 Jun 16 08:57
> dataflow_tester.py
> Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59
> mypackage
> Step #0 - "dftester-image": -rw-r--r-- 1 root root   64 Jun 16 08:57
> requirements.txt
> Step #0 - "dftester-image": -rw-r--r-- 1 root root  736 Jun 16 08:57
> setup.py
> Step #0 - "dftester-image": Removing intermediate container cf164108f9d6
> Step #0 - "dftester-image":  ---> eb1a080b7948
> Step #0 - "dftester-image": Step 13/23 : RUN echo '--- listing modules
> -'
> Step #0 - "dftester-image":  ---> Running in 884f03dd81d6
> Step #0 - "dftester-image": --- listing modules -
> Step #0 - "dftester-image": Removing intermediate container 884f03dd81d6
> Step #0 - "dftester-image":  ---> 9f6f7e27bd2f
> Step #0 - "dftester-image": Step 14/23 : RUN ls -la  ${WORKDIR}/mypackage
> Step #0 - "dftester-image":  ---> Running in bd74ade37010
> Step #0 - "dftester-image": total 16
> Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59 .
> Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
> Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
> __init__.py
> Step #0 - "dftester-image": -rw-r--r-- 1 root root 1442 Jun 16 08:57
> launcher.py
> Step #0 - "dftester-image": -rw-r--r-- 1 root root  607 Jun 16 08:57
> obb_utils.py
> Step #0 - "dftester-image": Removing intermediate container bd74ade37010
>
>
> i have this in my setup.py
>
> REQUIRED_PACKAGES = [
> 'openbb',
> "apache-beam[gcp]",  # Must match the version in `Dockerfile``.
> 'sendgrid',
> 'pandas_datareader',
> 'vaderSentiment',
> 'numpy',
> 'bs4',
> 'lxml',
> 'pandas_datareader',
> 'beautifulsoup4',
> 'xlrd',
> 'openpyxl'
> ]
>
>
> setuptools.setup(
> name='mypackage',
> version='0.0.1',
> description='Shres Runner Package.',
> install_requires=REQUIRED_PACKAGES,
> packages=setuptools.find_packages()
> )
>
>
> and this is my dataflow_tester.py
>
> from mypackage import launcher
> import logging
> if __name__ == '__main__':
>   logging.getLogger().setLevel(logging.INFO)
>   launcher.run()
>
>
>
> have compared my setup vs
> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
> and all looks the same (apart from my copying the __init__.,py fromo the
> directory where the main file(dataflow_tester.py) resides
>
> Would you know how else can i debug what is going on and why my
> mypackages subdirectory is not being seen?
>
> Kind regars
>  Marco
>
>
>
>
> On Sat, Jun 15, 2024 at 7:27 PM Valentyn 

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread Sofia’s World
Valentin, many thanks... i actually spotted the reference in teh setup file
However , after correcting it, i am still at square 1 where somehow my
runtime environment does not see it.. so i added some debugging to my
Dockerfile to check if i forgot to copy something,
and here's the output, where i can see the mypackage has been copied

here's my directory structure

 mypackage
__init__.py
obbutils.py
launcher.py
__init__.py
dataflow_tester.py
setup_dftester.py (copied to setup.py)

i can see the directory structure has been maintained when i copy my files
to docker as i added some debug to my dockerfile

Step #0 - "dftester-image": Removing intermediate container 4c4e763289d2
Step #0 - "dftester-image":  ---> cda378f70a9e
Step #0 - "dftester-image": Step 6/23 : COPY requirements.txt .
Step #0 - "dftester-image":  ---> 9a43da08b013
Step #0 - "dftester-image": Step 7/23 : COPY setup_dftester.py setup.py
Step #0 - "dftester-image":  ---> 5a6bf71df052
Step #0 - "dftester-image": Step 8/23 : COPY dataflow_tester.py .
Step #0 - "dftester-image":  ---> 82cfe1f1f9ed
Step #0 - "dftester-image": Step 9/23 : COPY mypackage mypackage
Step #0 - "dftester-image":  ---> d86497b791d0
Step #0 - "dftester-image": Step 10/23 : COPY __init__.py
${WORKDIR}/__init__.py
Step #0 - "dftester-image":  ---> 337d149d64c7
Step #0 - "dftester-image": Step 11/23 : RUN echo '- listing workdir'
Step #0 - "dftester-image":  ---> Running in 9d97d8a64319
Step #0 - "dftester-image": - listing workdir
Step #0 - "dftester-image": Removing intermediate container 9d97d8a64319
Step #0 - "dftester-image":  ---> bc9a6a2aa462
Step #0 - "dftester-image": Step 12/23 : RUN ls -la ${WORKDIR}
Step #0 - "dftester-image":  ---> Running in cf164108f9d6
Step #0 - "dftester-image": total 24
Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 .
Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
__init__.py
Step #0 - "dftester-image": -rw-r--r-- 1 root root  135 Jun 16 08:57
dataflow_tester.py
Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59
mypackage
Step #0 - "dftester-image": -rw-r--r-- 1 root root   64 Jun 16 08:57
requirements.txt
Step #0 - "dftester-image": -rw-r--r-- 1 root root  736 Jun 16 08:57
setup.py
Step #0 - "dftester-image": Removing intermediate container cf164108f9d6
Step #0 - "dftester-image":  ---> eb1a080b7948
Step #0 - "dftester-image": Step 13/23 : RUN echo '--- listing modules
-'
Step #0 - "dftester-image":  ---> Running in 884f03dd81d6
Step #0 - "dftester-image": --- listing modules -
Step #0 - "dftester-image": Removing intermediate container 884f03dd81d6
Step #0 - "dftester-image":  ---> 9f6f7e27bd2f
Step #0 - "dftester-image": Step 14/23 : RUN ls -la  ${WORKDIR}/mypackage
Step #0 - "dftester-image":  ---> Running in bd74ade37010
Step #0 - "dftester-image": total 16
Step #0 - "dftester-image": drwxr-xr-x 2 root root 4096 Jun 16 08:59 .
Step #0 - "dftester-image": drwxr-xr-x 1 root root 4096 Jun 16 08:59 ..
Step #0 - "dftester-image": -rw-r--r-- 1 root root0 Jun 16 08:57
__init__.py
Step #0 - "dftester-image": -rw-r--r-- 1 root root 1442 Jun 16 08:57
launcher.py
Step #0 - "dftester-image": -rw-r--r-- 1 root root  607 Jun 16 08:57
obb_utils.py
Step #0 - "dftester-image": Removing intermediate container bd74ade37010


i have this in my setup.py

REQUIRED_PACKAGES = [
'openbb',
"apache-beam[gcp]",  # Must match the version in `Dockerfile``.
'sendgrid',
'pandas_datareader',
'vaderSentiment',
'numpy',
'bs4',
'lxml',
'pandas_datareader',
'beautifulsoup4',
'xlrd',
'openpyxl'
]


setuptools.setup(
name='mypackage',
version='0.0.1',
description='Shres Runner Package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages()
)


and this is my dataflow_tester.py

from mypackage import launcher
import logging
if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  launcher.run()



have compared my setup vs
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
and all looks the same (apart from my copying the __init__.,py fromo the
directory where the main file(dataflow_tester.py) resides

Would you know how else can i debug what is going on and why my  mypackages
subdirectory is not being seen?

Kind regars
 Marco




On Sat, Jun 15, 2024 at 7:27 PM Valentyn Tymofieiev via user <
user@beam.apache.org> wrote:

> Your pipeline launcher refers to a package named 'modules', but this
> package is not available in the runtime environment.
>
> On Sat, Jun 15, 2024 at 11:17 AM Sofia’s World 
> wrote:
>
>> Sorry, i cheered up too early
>> i can successfully build the image however, at runtime the code fails
>> always with this exception and i cannot figure out why
>>
>> i mimicked the sample directory structure
>>
>>
>>  mypackage

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-15 Thread Valentyn Tymofieiev via user
Your pipeline launcher refers to a package named 'modules', but this
package is not available in the runtime environment.

On Sat, Jun 15, 2024 at 11:17 AM Sofia’s World  wrote:

> Sorry, i cheered up too early
> i can successfully build the image however, at runtime the code fails
> always with this exception and i cannot figure out why
>
> i mimicked the sample directory structure
>
>
>  mypackage
>--- __init__,py
>dftester.py
>obb_utils.py
>
> dataflow_tester_main.py
>
> this is the content of my dataflow_tester_main.py
>
> from mypackage import dftester
> import logging
> if __name__ == '__main__':
>   logging.getLogger().setLevel(logging.INFO)
>   dftester.run()
>
>
> and this is my dockerfile
>
>
> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/Dockerfile_tester
>
> and at the bottom if this email my exception
> I am puzzled on where the error is coming from as i have almost copied
> this sample
> https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/main.py
>
> thanks and regards
>  Marco
>
>
>
>
>
>
>
>
>
>
>
> Traceback (most recent call last): File
> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 115, in create_harness _load_main_session(semi_persistent_directory)
> File
> "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 354, in _load_main_session pickler.load_session(session_file) File
> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/pickler.py",
> line 65, in load_session return desired_pickle_lib.load_session(file_path)
> ^^ File
> "/usr/local/lib/python3.11/site-packages/apache_beam/internal/dill_pickler.py",
> line 446, in load_session return dill.load_session(file_path)
>  File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 368, in
> load_session module = unpickler.load()  File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) ^ File
> "/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 462, in
> find_class return StockUnpickler.find_class(self, module, name)
> ^ ModuleNotFoundError: No
> module named 'modules'
>
>
>
>
>
>
>
> On Fri, Jun 14, 2024 at 5:52 AM Sofia’s World  wrote:
>
>> Many thanks Hu, worked like a charm
>>
>> few qq
>> so in my reqs.txt i should put all beam requirements PLUS my own?
>>
>> and in the setup.py, shall i just declare
>>
>> "apache-beam[gcp]==2.54.0",  # Must match the version in `Dockerfile``.
>>
>> thanks and kind regards
>> Marco
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2024 at 1:48 PM XQ Hu  wrote:
>>
>>> Any reason to use this?
>>>
>>> RUN pip install avro-python3 pyarrow==0.15.1 apache-beam[gcp]==2.30.0
>>>  pandas-datareader==0.9.0
>>>
>>> It is typically recommended to use the latest Beam and build the docker
>>> image using the requirements released for each Beam, for example,
>>> https://github.com/apache/beam/blob/release-2.56.0/sdks/python/container/py311/base_image_requirements.txt
>>>
>>> On Wed, Jun 12, 2024 at 1:31 AM Sofia’s World 
>>> wrote:
>>>
 Sure, apologies, it crossed my mind it would have been useful to refert
 to it

 so this is the docker file


 https://github.com/mmistroni/GCP_Experiments/edit/master/dataflow/shareloader/Dockerfile_tester

 I was using a setup.py as well, but then i commented out the usage in
 the dockerfile after checking some flex templates which said it is not
 needed


 https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/setup_dftester.py

 thanks in advance
  Marco







 On Tue, Jun 11, 2024 at 10:54 PM XQ Hu  wrote:

> Can you share your Dockerfile?
>
> On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World 
> wrote:
>
>> thanks all,  it seemed to work but now i am getting a different
>> problem, having issues in building pyarrow...
>>
>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":  
>>  :36: DeprecationWarning: pkg_resources is deprecated as an API. 
>> See https://setuptools.pypa.io/en/latest/pkg_resources.html
>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":  
>>  WARNING setuptools_scm.pyproject_reading toml section missing 
>> 'pyproject.toml does not contain a tool.setuptools_scm section'
>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":  
>>  Traceback (most recent call last):
>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":  
>>File 
>> "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_re

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-15 Thread Sofia’s World
Sorry, i cheered up too early
i can successfully build the image however, at runtime the code fails
always with this exception and i cannot figure out why

i mimicked the sample directory structure


 mypackage
   --- __init__,py
   dftester.py
   obb_utils.py

dataflow_tester_main.py

this is the content of my dataflow_tester_main.py

from mypackage import dftester
import logging
if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  dftester.run()


and this is my dockerfile

https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/Dockerfile_tester

and at the bottom if this email my exception
I am puzzled on where the error is coming from as i have almost copied this
sample
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/main.py

thanks and regards
 Marco











Traceback (most recent call last): File
"/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 115, in create_harness _load_main_session(semi_persistent_directory)
File
"/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 354, in _load_main_session pickler.load_session(session_file) File
"/usr/local/lib/python3.11/site-packages/apache_beam/internal/pickler.py",
line 65, in load_session return desired_pickle_lib.load_session(file_path)
^^ File
"/usr/local/lib/python3.11/site-packages/apache_beam/internal/dill_pickler.py",
line 446, in load_session return dill.load_session(file_path)
 File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 368, in
load_session module = unpickler.load()  File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self) ^ File
"/usr/local/lib/python3.11/site-packages/dill/_dill.py", line 462, in
find_class return StockUnpickler.find_class(self, module, name)
^ ModuleNotFoundError: No
module named 'modules'







On Fri, Jun 14, 2024 at 5:52 AM Sofia’s World  wrote:

> Many thanks Hu, worked like a charm
>
> few qq
> so in my reqs.txt i should put all beam requirements PLUS my own?
>
> and in the setup.py, shall i just declare
>
> "apache-beam[gcp]==2.54.0",  # Must match the version in `Dockerfile``.
>
> thanks and kind regards
> Marco
>
>
>
>
>
>
> On Wed, Jun 12, 2024 at 1:48 PM XQ Hu  wrote:
>
>> Any reason to use this?
>>
>> RUN pip install avro-python3 pyarrow==0.15.1 apache-beam[gcp]==2.30.0
>>  pandas-datareader==0.9.0
>>
>> It is typically recommended to use the latest Beam and build the docker
>> image using the requirements released for each Beam, for example,
>> https://github.com/apache/beam/blob/release-2.56.0/sdks/python/container/py311/base_image_requirements.txt
>>
>> On Wed, Jun 12, 2024 at 1:31 AM Sofia’s World 
>> wrote:
>>
>>> Sure, apologies, it crossed my mind it would have been useful to refert
>>> to it
>>>
>>> so this is the docker file
>>>
>>>
>>> https://github.com/mmistroni/GCP_Experiments/edit/master/dataflow/shareloader/Dockerfile_tester
>>>
>>> I was using a setup.py as well, but then i commented out the usage in
>>> the dockerfile after checking some flex templates which said it is not
>>> needed
>>>
>>>
>>> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/setup_dftester.py
>>>
>>> thanks in advance
>>>  Marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 11, 2024 at 10:54 PM XQ Hu  wrote:
>>>
 Can you share your Dockerfile?

 On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World 
 wrote:

> thanks all,  it seemed to work but now i am getting a different
> problem, having issues in building pyarrow...
>
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> :36: DeprecationWarning: pkg_resources is deprecated as an API. 
> See https://setuptools.pypa.io/en/latest/pkg_resources.html
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> WARNING setuptools_scm.pyproject_reading toml section missing 
> 'pyproject.toml does not contain a tool.setuptools_scm section'
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> Traceback (most recent call last):
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>   File 
> "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
>  line 36, in read_pyproject
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> section = defn.get("tool", {})[tool_name]
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>   ^^^
> Step #0 - "build-shareloader-template": Step #4 - "dfteste

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-14 Thread Valentyn Tymofieiev via user
I recommend to put all top-level dependencies for your pipeline in setup.py
install_requires section, and autogenerate the requirements.txt, which
would then include all transitive dependencies and ensure reproducible
builds.

For approaches to generate the requirements.txt file from top level
requirements specified in the setup.py file, see:
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies#optional-update-the-dependencies-in-the-requirements-file-and-rebuild-the-docker-images
.

Valentyn

On Thu, Jun 13, 2024 at 9:52 PM Sofia’s World  wrote:

> Many thanks Hu, worked like a charm
>
> few qq
> so in my reqs.txt i should put all beam requirements PLUS my own?
>
> and in the setup.py, shall i just declare
>
> "apache-beam[gcp]==2.54.0",  # Must match the version in `Dockerfile``.
>
> thanks and kind regards
> Marco
>
>
>
>
>
>
> On Wed, Jun 12, 2024 at 1:48 PM XQ Hu  wrote:
>
>> Any reason to use this?
>>
>> RUN pip install avro-python3 pyarrow==0.15.1 apache-beam[gcp]==2.30.0
>>  pandas-datareader==0.9.0
>>
>> It is typically recommended to use the latest Beam and build the docker
>> image using the requirements released for each Beam, for example,
>> https://github.com/apache/beam/blob/release-2.56.0/sdks/python/container/py311/base_image_requirements.txt
>>
>> On Wed, Jun 12, 2024 at 1:31 AM Sofia’s World 
>> wrote:
>>
>>> Sure, apologies, it crossed my mind it would have been useful to refert
>>> to it
>>>
>>> so this is the docker file
>>>
>>>
>>> https://github.com/mmistroni/GCP_Experiments/edit/master/dataflow/shareloader/Dockerfile_tester
>>>
>>> I was using a setup.py as well, but then i commented out the usage in
>>> the dockerfile after checking some flex templates which said it is not
>>> needed
>>>
>>>
>>> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/setup_dftester.py
>>>
>>> thanks in advance
>>>  Marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 11, 2024 at 10:54 PM XQ Hu  wrote:
>>>
 Can you share your Dockerfile?

 On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World 
 wrote:

> thanks all,  it seemed to work but now i am getting a different
> problem, having issues in building pyarrow...
>
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> :36: DeprecationWarning: pkg_resources is deprecated as an API. 
> See https://setuptools.pypa.io/en/latest/pkg_resources.html
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> WARNING setuptools_scm.pyproject_reading toml section missing 
> 'pyproject.toml does not contain a tool.setuptools_scm section'
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> Traceback (most recent call last):
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>   File 
> "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
>  line 36, in read_pyproject
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> section = defn.get("tool", {})[tool_name]
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>   ^^^
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> KeyError: 'setuptools_scm'
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> running bdist_wheel
>
>
>
>
> It is somehow getting messed up with a toml ?
>
>
> Could anyone advise?
>
> thanks
>
>  Marco
>
>
>
>
>
> On Tue, Jun 11, 2024 at 1:00 AM XQ Hu via user 
> wrote:
>
>>
>> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
>> is a great example.
>>
>> On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
>> user@beam.apache.org> wrote:
>>
>>> In this case the Python version will be defined by the Python
>>> version installed in the docker image of your flex template. So, you'd
>>> have to build your flex template from a base image with Python 3.11.
>>>
>>> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
>>> wrote:
>>>
 Hello
  no i am running my pipelien on  GCP directly via a flex template,
 configured using a Docker file
 Any chances to do something in the Dockerfile to force the version
 at runtime?
 Thanks

 On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
 user@beam.apache.org> wrote:

> Hello,
>
> Are you running your pipeline from the python 3.11 environment?
> If you are running from a python 3.11 environment and don't use a 
> custom
> docker c

Re: How windowing is implemented on Flink runner

2024-06-14 Thread Wiśniowski Piotr

Hi,

Wanted to follow up as I did have similar case.

So this means it is ok for Beam to use Sliding window of 1 day with 1 
sec period (with using different trigger than after watermark to avoid 
outputting data from every window) and there is no additional 
performance penalty (duplicating input messages for storage or cpu for 
resolving windows)? Interesting from both Flink and Dataflow perspective 
(both Python and Java).


I ended up implementing the logic with Beam state and timers (which is 
quite performant and readable) but also interested in other possibilities.


Best

Wiśniowski Piort

On 12.06.2024 21:50, Ruben Vargas wrote:

I imagined it but wasn't sure!

Thanks for the clarification!

On Wed, Jun 12, 2024 at 1:42 PM Robert Bradshaw via user
 wrote:

Beam implements Windowing itself (via state and timers) rather than
deferring to Flink's implementation.

On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas  wrote:

Hello guys

May be a silly question,

But in the Flink runner, the window implementation uses the Flink
windowing? Does that mean the runner will have performance issues like
Flink itself? see this:
https://issues.apache.org/jira/browse/FLINK-7001

I'm asking because I see the issue, it mentions different concepts
that Beam already handles at the API level. So my suspicion is that
the Beam model handles windowing a little differently from the pure
Flink app. But I'm not sure..


Regards.


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-13 Thread Sofia’s World
Many thanks Hu, worked like a charm

few qq
so in my reqs.txt i should put all beam requirements PLUS my own?

and in the setup.py, shall i just declare

"apache-beam[gcp]==2.54.0",  # Must match the version in `Dockerfile``.

thanks and kind regards
Marco






On Wed, Jun 12, 2024 at 1:48 PM XQ Hu  wrote:

> Any reason to use this?
>
> RUN pip install avro-python3 pyarrow==0.15.1 apache-beam[gcp]==2.30.0
>  pandas-datareader==0.9.0
>
> It is typically recommended to use the latest Beam and build the docker
> image using the requirements released for each Beam, for example,
> https://github.com/apache/beam/blob/release-2.56.0/sdks/python/container/py311/base_image_requirements.txt
>
> On Wed, Jun 12, 2024 at 1:31 AM Sofia’s World  wrote:
>
>> Sure, apologies, it crossed my mind it would have been useful to refert
>> to it
>>
>> so this is the docker file
>>
>>
>> https://github.com/mmistroni/GCP_Experiments/edit/master/dataflow/shareloader/Dockerfile_tester
>>
>> I was using a setup.py as well, but then i commented out the usage in the
>> dockerfile after checking some flex templates which said it is not needed
>>
>>
>> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/setup_dftester.py
>>
>> thanks in advance
>>  Marco
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 11, 2024 at 10:54 PM XQ Hu  wrote:
>>
>>> Can you share your Dockerfile?
>>>
>>> On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World 
>>> wrote:
>>>
 thanks all,  it seemed to work but now i am getting a different
 problem, having issues in building pyarrow...

 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
 :36: DeprecationWarning: pkg_resources is deprecated as an API. 
 See https://setuptools.pypa.io/en/latest/pkg_resources.html
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
 WARNING setuptools_scm.pyproject_reading toml section missing 
 'pyproject.toml does not contain a tool.setuptools_scm section'
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
 Traceback (most recent call last):
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
  File 
 "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
  line 36, in read_pyproject
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
section = defn.get("tool", {})[tool_name]
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
  ^^^
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
 KeyError: 'setuptools_scm'
 Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
 running bdist_wheel




 It is somehow getting messed up with a toml ?


 Could anyone advise?

 thanks

  Marco





 On Tue, Jun 11, 2024 at 1:00 AM XQ Hu via user 
 wrote:

>
> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
> is a great example.
>
> On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
> user@beam.apache.org> wrote:
>
>> In this case the Python version will be defined by the Python version
>> installed in the docker image of your flex template. So, you'd have to
>> build your flex template from a base image with Python 3.11.
>>
>> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
>> wrote:
>>
>>> Hello
>>>  no i am running my pipelien on  GCP directly via a flex template,
>>> configured using a Docker file
>>> Any chances to do something in the Dockerfile to force the version
>>> at runtime?
>>> Thanks
>>>
>>> On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
>>> user@beam.apache.org> wrote:
>>>
 Hello,

 Are you running your pipeline from the python 3.11 environment?  If
 you are running from a python 3.11 environment and don't use a custom
 docker container image, DataflowRunner(Assuming Apache Beam on GCP 
 means
 Apache Beam on DataflowRunner), will use Python 3.11.

 Thanks,
 Anand

>>>


Re: How windowing is implemented on Flink runner

2024-06-12 Thread Ruben Vargas
I imagined it but wasn't sure!

Thanks for the clarification!

On Wed, Jun 12, 2024 at 1:42 PM Robert Bradshaw via user
 wrote:
>
> Beam implements Windowing itself (via state and timers) rather than
> deferring to Flink's implementation.
>
> On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas  wrote:
> >
> > Hello guys
> >
> > May be a silly question,
> >
> > But in the Flink runner, the window implementation uses the Flink
> > windowing? Does that mean the runner will have performance issues like
> > Flink itself? see this:
> > https://issues.apache.org/jira/browse/FLINK-7001
> >
> > I'm asking because I see the issue, it mentions different concepts
> > that Beam already handles at the API level. So my suspicion is that
> > the Beam model handles windowing a little differently from the pure
> > Flink app. But I'm not sure..
> >
> >
> > Regards.


Re: How windowing is implemented on Flink runner

2024-06-12 Thread Robert Bradshaw via user
Beam implements Windowing itself (via state and timers) rather than
deferring to Flink's implementation.

On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas  wrote:
>
> Hello guys
>
> May be a silly question,
>
> But in the Flink runner, the window implementation uses the Flink
> windowing? Does that mean the runner will have performance issues like
> Flink itself? see this:
> https://issues.apache.org/jira/browse/FLINK-7001
>
> I'm asking because I see the issue, it mentions different concepts
> that Beam already handles at the API level. So my suspicion is that
> the Beam model handles windowing a little differently from the pure
> Flink app. But I'm not sure..
>
>
> Regards.


Re: Paralalelism of a side input

2024-06-12 Thread Robert Bradshaw via user
On Wed, Jun 12, 2024 at 7:56 AM Ruben Vargas  wrote:
>
> The approach looks good. but one question
>
> My understanding is that this will schedule for example 8 operators across 
> the workers, but only one of them will be processing, the others remain idle? 
> Are those consuming resources in some way? I'm assuming may be is not 
> significant.

That is correct, but the resources consumed by an idle operator should
be negligible.

> Thanks.
>
> El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user 
>  escribió:
>>
>> You can always limit the parallelism by assigning a single key to
>> every element and then doing a grouping or reshuffle[1] on that key
>> before processing the elements. Even if the operator parallelism for
>> that step is technically, say, eight, your effective parallelism will
>> be exactly one.
>>
>> [1] 
>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html
>>
>> On Fri, Jun 7, 2024 at 2:13 PM Ruben Vargas  wrote:
>> >
>> > Hello guys
>> >
>> > One question, I have a side input which fetches an endpoint each 30
>> > min, I pretty much copied the example here:
>> > https://beam.apache.org/documentation/patterns/side-inputs/ but added
>> > some logic to fetch the endpoint and parse the payload.
>> >
>> > My question is: it is possible to control the parallelism of this
>> > single ParDo that does the fetch/transform? I don't think I need a lot
>> > of parallelism for that one. I'm currently using flink runner and I
>> > see the parallelism is 8 (which is the general parallelism for my
>> > flink cluster).
>> >
>> > Is it possible to set it to 1 for example?
>> >
>> >
>> > Regards.


Re: Paralalelism of a side input

2024-06-12 Thread Ruben Vargas
The approach looks good. but one question

My understanding is that this will schedule for example 8 operators across
the workers, but only one of them will be processing, the others
remain idle? Are those consuming resources in some way? I'm assuming may be
is not significant.

Thanks.

El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user <
user@beam.apache.org> escribió:

> You can always limit the parallelism by assigning a single key to
> every element and then doing a grouping or reshuffle[1] on that key
> before processing the elements. Even if the operator parallelism for
> that step is technically, say, eight, your effective parallelism will
> be exactly one.
>
> [1]
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html
>
> On Fri, Jun 7, 2024 at 2:13 PM Ruben Vargas 
> wrote:
> >
> > Hello guys
> >
> > One question, I have a side input which fetches an endpoint each 30
> > min, I pretty much copied the example here:
> > https://beam.apache.org/documentation/patterns/side-inputs/ but added
> > some logic to fetch the endpoint and parse the payload.
> >
> > My question is: it is possible to control the parallelism of this
> > single ParDo that does the fetch/transform? I don't think I need a lot
> > of parallelism for that one. I'm currently using flink runner and I
> > see the parallelism is 8 (which is the general parallelism for my
> > flink cluster).
> >
> > Is it possible to set it to 1 for example?
> >
> >
> > Regards.
>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-12 Thread XQ Hu via user
Any reason to use this?

RUN pip install avro-python3 pyarrow==0.15.1 apache-beam[gcp]==2.30.0
 pandas-datareader==0.9.0

It is typically recommended to use the latest Beam and build the docker
image using the requirements released for each Beam, for example,
https://github.com/apache/beam/blob/release-2.56.0/sdks/python/container/py311/base_image_requirements.txt

On Wed, Jun 12, 2024 at 1:31 AM Sofia’s World  wrote:

> Sure, apologies, it crossed my mind it would have been useful to refert to
> it
>
> so this is the docker file
>
>
> https://github.com/mmistroni/GCP_Experiments/edit/master/dataflow/shareloader/Dockerfile_tester
>
> I was using a setup.py as well, but then i commented out the usage in the
> dockerfile after checking some flex templates which said it is not needed
>
>
> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/shareloader/setup_dftester.py
>
> thanks in advance
>  Marco
>
>
>
>
>
>
>
> On Tue, Jun 11, 2024 at 10:54 PM XQ Hu  wrote:
>
>> Can you share your Dockerfile?
>>
>> On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World 
>> wrote:
>>
>>> thanks all,  it seemed to work but now i am getting a different problem,
>>> having issues in building pyarrow...
>>>
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>>> :36: DeprecationWarning: pkg_resources is deprecated as an API. See 
>>> https://setuptools.pypa.io/en/latest/pkg_resources.html
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>>> WARNING setuptools_scm.pyproject_reading toml section missing 
>>> 'pyproject.toml does not contain a tool.setuptools_scm section'
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>>> Traceback (most recent call last):
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image": 
>>> File 
>>> "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
>>>  line 36, in read_pyproject
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image": 
>>>   section = defn.get("tool", {})[tool_name]
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image": 
>>> ^^^
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>>> KeyError: 'setuptools_scm'
>>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>>> running bdist_wheel
>>>
>>>
>>>
>>>
>>> It is somehow getting messed up with a toml ?
>>>
>>>
>>> Could anyone advise?
>>>
>>> thanks
>>>
>>>  Marco
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 11, 2024 at 1:00 AM XQ Hu via user 
>>> wrote:
>>>

 https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
 is a great example.

 On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
 user@beam.apache.org> wrote:

> In this case the Python version will be defined by the Python version
> installed in the docker image of your flex template. So, you'd have to
> build your flex template from a base image with Python 3.11.
>
> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
> wrote:
>
>> Hello
>>  no i am running my pipelien on  GCP directly via a flex template,
>> configured using a Docker file
>> Any chances to do something in the Dockerfile to force the version at
>> runtime?
>> Thanks
>>
>> On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
>> user@beam.apache.org> wrote:
>>
>>> Hello,
>>>
>>> Are you running your pipeline from the python 3.11 environment?  If
>>> you are running from a python 3.11 environment and don't use a custom
>>> docker container image, DataflowRunner(Assuming Apache Beam on GCP means
>>> Apache Beam on DataflowRunner), will use Python 3.11.
>>>
>>> Thanks,
>>> Anand
>>>
>>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-11 Thread XQ Hu via user
Can you share your Dockerfile?

On Tue, Jun 11, 2024 at 4:43 PM Sofia’s World  wrote:

> thanks all,  it seemed to work but now i am getting a different problem,
> having issues in building pyarrow...
>
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> :36: DeprecationWarning: pkg_resources is deprecated as an API. See 
> https://setuptools.pypa.io/en/latest/pkg_resources.html
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> WARNING setuptools_scm.pyproject_reading toml section missing 'pyproject.toml 
> does not contain a tool.setuptools_scm section'
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> Traceback (most recent call last):
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image": 
> File 
> "/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
>  line 36, in read_pyproject
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> section = defn.get("tool", {})[tool_name]
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
>   ^^^
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> KeyError: 'setuptools_scm'
> Step #0 - "build-shareloader-template": Step #4 - "dftester-image":   
> running bdist_wheel
>
>
>
>
> It is somehow getting messed up with a toml ?
>
>
> Could anyone advise?
>
> thanks
>
>  Marco
>
>
>
>
>
> On Tue, Jun 11, 2024 at 1:00 AM XQ Hu via user 
> wrote:
>
>>
>> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
>> is a great example.
>>
>> On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
>> user@beam.apache.org> wrote:
>>
>>> In this case the Python version will be defined by the Python version
>>> installed in the docker image of your flex template. So, you'd have to
>>> build your flex template from a base image with Python 3.11.
>>>
>>> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
>>> wrote:
>>>
 Hello
  no i am running my pipelien on  GCP directly via a flex template,
 configured using a Docker file
 Any chances to do something in the Dockerfile to force the version at
 runtime?
 Thanks

 On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
 user@beam.apache.org> wrote:

> Hello,
>
> Are you running your pipeline from the python 3.11 environment?  If
> you are running from a python 3.11 environment and don't use a custom
> docker container image, DataflowRunner(Assuming Apache Beam on GCP means
> Apache Beam on DataflowRunner), will use Python 3.11.
>
> Thanks,
> Anand
>



Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-11 Thread Sofia’s World
thanks all,  it seemed to work but now i am getting a different problem,
having issues in building pyarrow...

Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   :36: DeprecationWarning: pkg_resources is deprecated as an
API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   WARNING setuptools_scm.pyproject_reading toml section missing
'pyproject.toml does not contain a tool.setuptools_scm section'
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   Traceback (most recent call last):
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
 File 
"/tmp/pip-build-env-meihcxsp/overlay/lib/python3.11/site-packages/setuptools_scm/_integration/pyproject_reading.py",
line 36, in read_pyproject
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   section = defn.get("tool", {})[tool_name]
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
 ^^^
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   KeyError: 'setuptools_scm'
Step #0 - "build-shareloader-template": Step #4 - "dftester-image":
   running bdist_wheel




It is somehow getting messed up with a toml ?


Could anyone advise?

thanks

 Marco





On Tue, Jun 11, 2024 at 1:00 AM XQ Hu via user  wrote:

>
> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
> is a great example.
>
> On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
> user@beam.apache.org> wrote:
>
>> In this case the Python version will be defined by the Python version
>> installed in the docker image of your flex template. So, you'd have to
>> build your flex template from a base image with Python 3.11.
>>
>> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
>> wrote:
>>
>>> Hello
>>>  no i am running my pipelien on  GCP directly via a flex template,
>>> configured using a Docker file
>>> Any chances to do something in the Dockerfile to force the version at
>>> runtime?
>>> Thanks
>>>
>>> On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
>>> user@beam.apache.org> wrote:
>>>
 Hello,

 Are you running your pipeline from the python 3.11 environment?  If you
 are running from a python 3.11 environment and don't use a custom docker
 container image, DataflowRunner(Assuming Apache Beam on GCP means Apache
 Beam on DataflowRunner), will use Python 3.11.

 Thanks,
 Anand

>>>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread XQ Hu via user
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies
is a great example.

On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user <
user@beam.apache.org> wrote:

> In this case the Python version will be defined by the Python version
> installed in the docker image of your flex template. So, you'd have to
> build your flex template from a base image with Python 3.11.
>
> On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World 
> wrote:
>
>> Hello
>>  no i am running my pipelien on  GCP directly via a flex template,
>> configured using a Docker file
>> Any chances to do something in the Dockerfile to force the version at
>> runtime?
>> Thanks
>>
>> On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
>> user@beam.apache.org> wrote:
>>
>>> Hello,
>>>
>>> Are you running your pipeline from the python 3.11 environment?  If you
>>> are running from a python 3.11 environment and don't use a custom docker
>>> container image, DataflowRunner(Assuming Apache Beam on GCP means Apache
>>> Beam on DataflowRunner), will use Python 3.11.
>>>
>>> Thanks,
>>> Anand
>>>
>>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread Valentyn Tymofieiev via user
In this case the Python version will be defined by the Python version
installed in the docker image of your flex template. So, you'd have to
build your flex template from a base image with Python 3.11.

On Mon, Jun 10, 2024 at 12:50 PM Sofia’s World  wrote:

> Hello
>  no i am running my pipelien on  GCP directly via a flex template,
> configured using a Docker file
> Any chances to do something in the Dockerfile to force the version at
> runtime?
> Thanks
>
> On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user <
> user@beam.apache.org> wrote:
>
>> Hello,
>>
>> Are you running your pipeline from the python 3.11 environment?  If you
>> are running from a python 3.11 environment and don't use a custom docker
>> container image, DataflowRunner(Assuming Apache Beam on GCP means Apache
>> Beam on DataflowRunner), will use Python 3.11.
>>
>> Thanks,
>> Anand
>>
>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread Sofia’s World
Hello
 no i am running my pipelien on  GCP directly via a flex template,
configured using a Docker file
Any chances to do something in the Dockerfile to force the version at
runtime?
Thanks

On Mon, Jun 10, 2024 at 7:24 PM Anand Inguva via user 
wrote:

> Hello,
>
> Are you running your pipeline from the python 3.11 environment?  If you
> are running from a python 3.11 environment and don't use a custom docker
> container image, DataflowRunner(Assuming Apache Beam on GCP means Apache
> Beam on DataflowRunner), will use Python 3.11.
>
> Thanks,
> Anand
>


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread Anand Inguva via user
Hello,

Are you running your pipeline from the python 3.11 environment?  If you are
running from a python 3.11 environment and don't use a custom docker
container image, DataflowRunner(Assuming Apache Beam on GCP means Apache
Beam on DataflowRunner), will use Python 3.11.

Thanks,
Anand


Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread Ahmet Altay via user
If you could use py 3.11 locally, you will get python 3.11 in your cloud
environment as well. Is that not happening?

When you run Apache Beam on GCP, the python version you are using in your
local virtual environment will be used in the cloud environment as well. I
believe this is true for non-GCP environments as well.


On Mon, Jun 10, 2024 at 11:08 AM Sofia’s World  wrote:

> Hello
>  sorry for the partially off topic question
> I am running a pipeline in which one of hte dependencies need to run on py
> 3.11
> But i dont see any options that allow me to force the python version to be
> used
>
> Could anyone help?
> Kind regards
> Marco
>


Re: Beam + VertexAI

2024-06-09 Thread XQ Hu via user
If you have a Vertex AI model, try
https://cloud.google.com/dataflow/docs/notebooks/run_inference_vertex_ai
If you want to use the Vertex AI model to do text embedding, try
https://cloud.google.com/dataflow/docs/notebooks/vertex_ai_text_embeddings

On Sun, Jun 9, 2024 at 4:40 AM Sofia’s World  wrote:

> HI all
>  i am looking for samples of integrating VertexAI into apache beam..
> As sample, i want to create a pipeline that retrieves some news
> information and will invoke
> VertexAI to summarize the main point of every news...
>
> Could you anyone give me some pointers?
> Kind regards
>  marco
>


Re: Paralalelism of a side input

2024-06-07 Thread Robert Bradshaw via user
You can always limit the parallelism by assigning a single key to
every element and then doing a grouping or reshuffle[1] on that key
before processing the elements. Even if the operator parallelism for
that step is technically, say, eight, your effective parallelism will
be exactly one.

[1] 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html

On Fri, Jun 7, 2024 at 2:13 PM Ruben Vargas  wrote:
>
> Hello guys
>
> One question, I have a side input which fetches an endpoint each 30
> min, I pretty much copied the example here:
> https://beam.apache.org/documentation/patterns/side-inputs/ but added
> some logic to fetch the endpoint and parse the payload.
>
> My question is: it is possible to control the parallelism of this
> single ParDo that does the fetch/transform? I don't think I need a lot
> of parallelism for that one. I'm currently using flink runner and I
> see the parallelism is 8 (which is the general parallelism for my
> flink cluster).
>
> Is it possible to set it to 1 for example?
>
>
> Regards.


Re: Question: Pipelines Stuck with Java 21 and BigQuery Storage Write API

2024-06-07 Thread Yi Hu via user
Hi,

Which runner are you using? If you are running on Dataflow runner, then
refer to this [1] and add
"--jdkAddOpenModules=java.base/java.lang=ALL-UNNAMED" to pipeline option.
If using direct runner, then add
"--add-opens=java.base/java.lang=ALL-UNNAMED" to JVM invocation command
line.

The same enforcement was introduced in both Java17 and 21, and it is
strange that Java17 worked without the option but Java21 didn't. Are you
testing on the same beam version and other configurations? Also, more
recent beam versions eliminated most usage of
"ClassLoadingStrategy.Default.INJECTION"
that cause this pipeline option being required, e.g. [2]. Try the latest
beam version 2.56.0 and this option may not be needed.

[1]
https://beam.apache.org/releases/javadoc/current/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html#getJdkAddOpenModules--

[2] https://github.com/apache/beam/pull/30367



On Mon, Jun 3, 2024 at 7:14 PM XQ Hu  wrote:

> Probably related to the strict encapsulation that is enforced with Java
> 21.
> Use `--add-opens=java.base/java.lang=ALL-UNNAMED` as the JVM flag could be
> a temporary workaround.
>
> On Mon, Jun 3, 2024 at 3:04 AM 田中万葉  wrote:
>
>> Hi all,
>>
>> I encountered an UnsupportedOperationException when using Java 21 and the
>> BigQuery Storage Write API in a Beam pipeline by using
>> ".withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API));"
>>
>> Having read issue #28120[1] and understanding that Beam version 2.52.0 or
>> later supports Java 21 as a runtime, I wonder why such an error happens.
>>
>> I found there are two workarounds, but the Storage Write API is a more
>> preferable way to insert data into BigQuery, so I'd like to find a
>> solution.
>>
>> 1. One workaround is to switch from Java 21 to Java 17(openjdk version
>> "17.0.10" 2024-01-16). By changing the  and
>>  in the pom.xml file (i.e., without modifying
>> App.java itself), the pipeline successfully writes data to my destination
>> table on BigQuery. It seems Java 17 and BigQuery Storage Write API works
>> fine.
>> 2. The other workaround is to change insert method. I tried the BigQuery
>> legacy streaming API(
>> https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery )
>> instead of the Storage Write API. Even though I still used Java 21, when I
>> changed my code to
>> .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS));, I did not
>> encounter the error.
>>
>> So I faced the error only when using Java 21 and BigQuery Storage Write
>> API.
>>
>> I uploaded the code below to reproduce. Could you please inform me how to
>> handle this issue?
>> https://github.com/cloud-ace/min-reproduce
>>
>> My Environment
>> - OS
>>   - Ubuntu 22.04
>>   - Mac OS Sonoma(14.3.1)
>> - beam 2.53.0, 2.54.0
>> - openjdk version "21.0.2" 2024-01-16
>> - maven 3.9.6
>> - DirectRunner
>>
>> Thanks,
>>
>> Kazuha
>>
>> [1]: https://github.com/apache/beam/issues/28120
>>
>> Here is the detailed error message.
>>
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>> java.lang.UnsupportedOperationException: Cannot define class using
>> reflection: Unable to make protected java.lang.Package
>> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
>> java.base does not "opens java.lang" to unnamed module @116d5dff
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot define class
>> using reflection: Unable to make protected java.lang.Package
>> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
>> java.base does not "opens java.lang" to unnamed module @116d5dff
>> at
>> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass
>> (ClassInjector.java:472)
>> at
>> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw
>> (ClassInjector.java:284)
>> at net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject
>> (ClassInjector.java:118)
>> at
>> net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load
>> (ClassLoadingStrategy.java:241)
>> at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load
>> (ClassLoadingStrategy.java:148)
>> at net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize
>> (TypeResolutionStrategy.java:101)
>> at net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load
>> (DynamicType.java:6317)
>> at
>> org.apache.beam.sdk.schemas.utils.AutoValueUtils.createBuilderCreator
>> (AutoValueUtils.java:247)
>> at org.apache.beam.sdk.schemas.utils.AutoValueUtils.getBuilderCreator
>> (AutoValueUtils.java:225)
>> at org.apache.beam.sdk.schemas.AutoValueSchema.schemaTypeCreator
>> (AutoValueSchema.java:122)
>> at org.apache.beam.sdk.schemas.CachingFactory.create
>> (CachingFactory.java:56)
>> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
>> (FromRowUsingCreator.java:94)
>> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
>> (FromRowUsingCreator.java:45)
>> at org.

Re: Question: Pipelines Stuck with Java 21 and BigQuery Storage Write API

2024-06-06 Thread 田中万葉
Hi, XQ
Thank you for your reply.

I tried to implement your suggestion by running `mvn compile exec:java
-Dexec.args="--add-opens=java.base/java.lang=ALL-UNNAMED"` or by editing a
plugin in the pom.xml as shown below.
However, I was unable to resolve this issue. As I am very new to Beam and
the OSS community, I am unsure of what the next step should be. Should I
create an issue on Github, or could you point out where I might be missing
something?

I appreciate your help.

Best regards,

Kazuha

```pom.xml
  
org.codehaus.mojo
exec-maven-plugin
3.0.0

  

  java

  


  com.example.MainClass
  

-Dexec.args="--add-opens=java.base/java.lang=ALL-UNNAMED"
  

  
```

On Tue, Jun 4, 2024 at 8:14 AM XQ Hu via user  wrote:

> Probably related to the strict encapsulation that is enforced with Java
> 21.
> Use `--add-opens=java.base/java.lang=ALL-UNNAMED` as the JVM flag could be
> a temporary workaround.
>
> On Mon, Jun 3, 2024 at 3:04 AM 田中万葉  wrote:
>
>> Hi all,
>>
>> I encountered an UnsupportedOperationException when using Java 21 and the
>> BigQuery Storage Write API in a Beam pipeline by using
>> ".withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API));"
>>
>> Having read issue #28120[1] and understanding that Beam version 2.52.0 or
>> later supports Java 21 as a runtime, I wonder why such an error happens.
>>
>> I found there are two workarounds, but the Storage Write API is a more
>> preferable way to insert data into BigQuery, so I'd like to find a
>> solution.
>>
>> 1. One workaround is to switch from Java 21 to Java 17(openjdk version
>> "17.0.10" 2024-01-16). By changing the  and
>>  in the pom.xml file (i.e., without modifying
>> App.java itself), the pipeline successfully writes data to my destination
>> table on BigQuery. It seems Java 17 and BigQuery Storage Write API works
>> fine.
>> 2. The other workaround is to change insert method. I tried the BigQuery
>> legacy streaming API(
>> https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery )
>> instead of the Storage Write API. Even though I still used Java 21, when I
>> changed my code to
>> .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS));, I did not
>> encounter the error.
>>
>> So I faced the error only when using Java 21 and BigQuery Storage Write
>> API.
>>
>> I uploaded the code below to reproduce. Could you please inform me how to
>> handle this issue?
>> https://github.com/cloud-ace/min-reproduce
>>
>> My Environment
>> - OS
>>   - Ubuntu 22.04
>>   - Mac OS Sonoma(14.3.1)
>> - beam 2.53.0, 2.54.0
>> - openjdk version "21.0.2" 2024-01-16
>> - maven 3.9.6
>> - DirectRunner
>>
>> Thanks,
>>
>> Kazuha
>>
>> [1]: https://github.com/apache/beam/issues/28120
>>
>> Here is the detailed error message.
>>
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>> java.lang.UnsupportedOperationException: Cannot define class using
>> reflection: Unable to make protected java.lang.Package
>> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
>> java.base does not "opens java.lang" to unnamed module @116d5dff
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot define class
>> using reflection: Unable to make protected java.lang.Package
>> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
>> java.base does not "opens java.lang" to unnamed module @116d5dff
>> at
>> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass
>> (ClassInjector.java:472)
>> at
>> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw
>> (ClassInjector.java:284)
>> at net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject
>> (ClassInjector.java:118)
>> at
>> net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load
>> (ClassLoadingStrategy.java:241)
>> at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load
>> (ClassLoadingStrategy.java:148)
>> at net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize
>> (TypeResolutionStrategy.java:101)
>> at net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load
>> (DynamicType.java:6317)
>> at
>> org.apache.beam.sdk.schemas.utils.AutoValueUtils.createBuilderCreator
>> (AutoValueUtils.java:247)
>> at org.apache.beam.sdk.schemas.utils.AutoValueUtils.getBuilderCreator
>> (AutoValueUtils.java:225)
>> at org.apache.beam.sdk.schemas.AutoValueSchema.schemaTypeCreator
>> (AutoValueSchema.java:122)
>> at org.apache.beam.sdk.schemas.CachingFactory.create
>> (CachingFactory.java:56)
>> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
>> (FromRowUsingCreator.java:94)
>> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
>> (FromRowUsingCreator.java:45)
>> at org.apache.beam.sdk.schemas.SchemaCoder.decode
>> (SchemaCoder.java:126)
>> at org.apache.be

Re: Question: Pipelines Stuck with Java 21 and BigQuery Storage Write API

2024-06-03 Thread XQ Hu via user
Probably related to the strict encapsulation that is enforced with Java 21.
Use `--add-opens=java.base/java.lang=ALL-UNNAMED` as the JVM flag could be
a temporary workaround.

On Mon, Jun 3, 2024 at 3:04 AM 田中万葉  wrote:

> Hi all,
>
> I encountered an UnsupportedOperationException when using Java 21 and the
> BigQuery Storage Write API in a Beam pipeline by using
> ".withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API));"
>
> Having read issue #28120[1] and understanding that Beam version 2.52.0 or
> later supports Java 21 as a runtime, I wonder why such an error happens.
>
> I found there are two workarounds, but the Storage Write API is a more
> preferable way to insert data into BigQuery, so I'd like to find a
> solution.
>
> 1. One workaround is to switch from Java 21 to Java 17(openjdk version
> "17.0.10" 2024-01-16). By changing the  and
>  in the pom.xml file (i.e., without modifying
> App.java itself), the pipeline successfully writes data to my destination
> table on BigQuery. It seems Java 17 and BigQuery Storage Write API works
> fine.
> 2. The other workaround is to change insert method. I tried the BigQuery
> legacy streaming API(
> https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery )
> instead of the Storage Write API. Even though I still used Java 21, when I
> changed my code to
> .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS));, I did not
> encounter the error.
>
> So I faced the error only when using Java 21 and BigQuery Storage Write
> API.
>
> I uploaded the code below to reproduce. Could you please inform me how to
> handle this issue?
> https://github.com/cloud-ace/min-reproduce
>
> My Environment
> - OS
>   - Ubuntu 22.04
>   - Mac OS Sonoma(14.3.1)
> - beam 2.53.0, 2.54.0
> - openjdk version "21.0.2" 2024-01-16
> - maven 3.9.6
> - DirectRunner
>
> Thanks,
>
> Kazuha
>
> [1]: https://github.com/apache/beam/issues/28120
>
> Here is the detailed error message.
>
> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
> java.lang.UnsupportedOperationException: Cannot define class using
> reflection: Unable to make protected java.lang.Package
> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
> java.base does not "opens java.lang" to unnamed module @116d5dff
>
> Caused by: java.lang.UnsupportedOperationException: Cannot define class
> using reflection: Unable to make protected java.lang.Package
> java.lang.ClassLoader.getPackage(java.lang.String) accessible: module
> java.base does not "opens java.lang" to unnamed module @116d5dff
> at
> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass
> (ClassInjector.java:472)
> at
> net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw
> (ClassInjector.java:284)
> at net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject
> (ClassInjector.java:118)
> at
> net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load
> (ClassLoadingStrategy.java:241)
> at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load
> (ClassLoadingStrategy.java:148)
> at net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize
> (TypeResolutionStrategy.java:101)
> at net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load
> (DynamicType.java:6317)
> at
> org.apache.beam.sdk.schemas.utils.AutoValueUtils.createBuilderCreator
> (AutoValueUtils.java:247)
> at org.apache.beam.sdk.schemas.utils.AutoValueUtils.getBuilderCreator
> (AutoValueUtils.java:225)
> at org.apache.beam.sdk.schemas.AutoValueSchema.schemaTypeCreator
> (AutoValueSchema.java:122)
> at org.apache.beam.sdk.schemas.CachingFactory.create
> (CachingFactory.java:56)
> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
> (FromRowUsingCreator.java:94)
> at org.apache.beam.sdk.schemas.FromRowUsingCreator.apply
> (FromRowUsingCreator.java:45)
> at org.apache.beam.sdk.schemas.SchemaCoder.decode
> (SchemaCoder.java:126)
> at org.apache.beam.sdk.coders.Coder.decode (Coder.java:159)
> at org.apache.beam.sdk.coders.KvCoder.decode (KvCoder.java:84)
> at org.apache.beam.sdk.coders.KvCoder.decode (KvCoder.java:37)
> at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream
> (CoderUtils.java:142)
> at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray
> (CoderUtils.java:102)
> at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray
> (CoderUtils.java:96)
> at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:168)
> at
> org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.
> (MutationDetectors.java:118)
> at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder
> (MutationDetectors.java:49)
> at
> org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add
> (ImmutabilityCheckingBundleFactory.java:115)
> at
> org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output

Re: Query about autinference of numPartitions for `JdbcIO#readWithPartitions`

2024-05-31 Thread XQ Hu via user
You should be able to configure the number of partition like this:

https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Java/src/main/java/jdbc/ReadPartitionsJdbc.java#L132

The code to  auto infer the number of partitions seems to be unreachable (I
haven't checked this carefully). More details are here:
https://issues.apache.org/jira/browse/BEAM-12456

On Fri, May 31, 2024 at 7:40 AM Vardhan Thigle via user <
user@beam.apache.org> wrote:

> Hi Beam Experts,I have a small query about `JdbcIO#readWithPartitions`
>
>
> ContextJdbcIO#readWithPartitions seems to always default
> to 200 partitions (DEFAULT_NUM_PARTITIONS). This is set by default when the
> object is constructed here
> 
> There seems to be no way to override this with a null value. Hence it
> seems that, the code
> 
>  that
> checks the null value and tries to auto infer the number of partitions
> based on the never runs.I am trying to use this for reading a tall table
> of unknown size, and the pipeline always defaults to 200 if the value is
> not set.  The default of 200 seems to fall short as worker goes out of
> memory in reshuffle stage. Running with higher number of partitions like 4K
> helps for my test setup.Since the size is not known at the time of
> implementing the pipeline, the auto-inference might help
> setting maxPartitions to a reasonable value as per the heuristic decided by
> Beam code.
> Request for help
>
> Could you please clarify a few doubts around this?
>
>1. Is this behavior intentional?
>2. Could you please explain the rationale behind the heuristic in L1398
>
> 
> and DEFAULT_NUM_PARTITIONS=200?
>
>
> I have also raised this as issues/31467 incase it needs any changes in
> the implementation.
>
>
> Regards and Thanks,
> Vardhan Thigle,
> +919535346204 <+91%2095353%2046204>
>


Re: Error handling for GCP Pub/Sub on Dataflow using Python

2024-05-25 Thread XQ Hu via user
I do not suggest you handle this in beam.io.WriteToPubSub. You could change
your pipeline to add one transform to check the message size. If it is
beyond 10 MB, you could use another sink or process the message to reduce
the size.

On Fri, May 24, 2024 at 3:46 AM Nimrod Shory  wrote:

> Hello group,
> I am pretty new to Dataflow and Beam.
> I have deployed a Dataflow streaming job using Beam with Python.
> The final step of my pipeline is publishing a message to Pub/Sub.
> In certain cases the message can become too big for Pub/Sub (larger than
> the allowed 10MB) and in that case of failure, it just retries to publish
> indefinitely, causing the Job to eventually stop processing new data.
>
> My question is, is there a way to handle failures in beam.io.WriteToPubSub
> or should I implement a similar method myself?
>
> Ideally, I would like to write the too large messages to a file on cloud
> storage.
>
> Any ideas will be appreciated.
>
> Thanks in advance for your help!
>
>


Re: Question: Java Apache Beam, mock external Clients initialized in Setup

2024-05-25 Thread XQ Hu via user
I am not sure which part you want to test. If the processData part should
be tested, you could refactor the code without use any Beam specific code
and test the processing data logic.

>From your example, it seems that you are calling some APIs, we recently
added a new Web API IO:
https://beam.apache.org/documentation/io/built-in/webapis/,
which provides a way to test.

On Wed, May 22, 2024 at 5:06 PM Ritwik Dutta via dev 
wrote:

> any response yet? No one has answers? I left a stackoverflow bounty on the
> question
>
> Using external methods is pretty important
>
> On Sunday, May 12, 2024 at 11:52:25 AM PDT, Ritwik Dutta <
> rdutt...@yahoo.com> wrote:
>
>
> Hi,
> I wrote the following question here.
> It would be really helpful also, if you can also update your documentation
> on Using Test Fakes in different Situations. It was very light
> documentation. Please provide more explanation and examples.
> https://beam.apache.org/documentation/io/testing/#:~:text=non%2DBeam%20client.-,Use%20fakes,-Instead%20of%20using
>
>
> *Question: *Java Apache Beam, mock external Clients initialized in @Setup
> method of DoFn with Constructors variables
>
> https://stackoverflow.com/questions/78468953/java-apache-beam-mock-external-clients-initialized-in-setup-method-of-dofn-wit
>
> Thanks,
>
> -Ritwik Dutta
>  734-262-4285 <(734)%20262-4285>
>


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-21 Thread Yarden BenMoshe
I agree it looks very similar, will continue to look at it with the help of
this issue.
Thanks, Jan!


Does the description in [1] match your case?
>
> [1] https://github.com/apache/beam/issues/31085#issuecomment-2115304242
> On 5/19/24 10:07, Yarden BenMoshe wrote:
>
> I am not running my pipeline from command-line, so used instead:
> options.setExperiments(Collections.singletonList("use_deprecated_read"));
>
> with ExperimentalOptions added to my options interface, however I dont
> think there's any effect to using it. in terms of the watermark, i received
> again:
> WatermarkHold.addHolds: element hold at 2024-05-19T07:52:59.999Z is on
> time for key:aaa-bbb-ccc;
> window:[2024-05-19T07:52:00.000Z..2024-05-19T07:53:00.000Z);
> inputWatermark:-290308-12-21T19:59:05.225Z;
> outputWatermark:-290308-12-21T19:59:05.225Z
>
>
>
> ‫בתאריך יום ה׳, 16 במאי 2024 ב-17:06 מאת ‪Jan Lukavský‬‏ <‪je...@seznam.cz
> ‬‏>:‬
>
>> Does using --experiments=use_deprecated_read have any effect?
>> On 5/16/24 14:30, Yarden BenMoshe wrote:
>>
>> Hi Jan, my PipelineOptions is as follows:
>> options.setStreaming(true);
>> options.setAttachedMode(false);
>> options.setRunner(FlinkRunner.class);
>>
>> I've also tried adding:
>> options.setAutoWatermarkInterval(100L);
>> as seen in some github issue, without any success so far.
>>
>> other than that, i am working with parallelism:3 and number of task
>> slots: 3
>>
>> Thanks!
>> Yarden
>>
>> ‫בתאריך יום ה׳, 16 במאי 2024 ב-15:05 מאת ‪Jan Lukavský‬‏ <‪
>> je...@seznam.cz‬‏>:‬
>>
>>> Hi Yarden,
>>>
>>> can you please provide all flink-related PipelineOptions you use for the
>>> job?
>>>
>>>   Jan
>>>
>>> On 5/16/24 13:44, Yarden BenMoshe wrote:
>>> > Hi all,
>>> > I have a project running with Beam 2.51, using Flink runner. In one of
>>> > my pipelines i have a FixedWindow and had a problem upgrading until
>>> > now, with a timers issue now resolved, and hopefully allowing me to
>>> > upgrade to version 2.56
>>> > However, I encounter another problem now which I believe is related to
>>> > watermarking(?).
>>> > My pipeline's source is a kafka topic.
>>> > My basic window definition is:
>>> >
>>> > PCollection>> windowCustomObjectInfo
>>> > = customObject.apply("windowCustomObjectInfo",
>>> >
>>> Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());
>>> >
>>> > and ever since upgrading to version 2.56 I am not getting any output
>>> > from that window. when enabling TRACE logs, i have this message:
>>> >
>>> > 2024-05-12 13:50:55,257 TRACE org.apache.beam.sdk.util.WindowTracing
>>> > [] - WatermarkHold.addHolds: element hold at 2024-05-12T13:50:59.999Z
>>> > is on time for key:test-12345;
>>> > window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z);
>>> > inputWatermark:-290308-12-21T19:59:05.225Z;
>>> > outputWatermark:-290308-12-21T19:59:05.225Z
>>> >
>>> >
>>> > Any hints on where should I look or maybe how I can adjust my window
>>> > definition? Are you familiar with any change that might be the cause
>>> > for my issue?
>>> > Thanks
>>>
>>


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-20 Thread Jan Lukavský

Does the description in [1] match your case?

[1] https://github.com/apache/beam/issues/31085#issuecomment-2115304242

On 5/19/24 10:07, Yarden BenMoshe wrote:

I am not running my pipeline from command-line, so used instead:
options.setExperiments(Collections.singletonList("use_deprecated_read"));

with ExperimentalOptions added to my options interface, however I dont 
think there's any effect to using it. in terms of the watermark, i 
received again:
WatermarkHold.addHolds: element hold at 2024-05-19T07:52:59.999Z is on 
time for key:aaa-bbb-ccc; 
window:[2024-05-19T07:52:00.000Z..2024-05-19T07:53:00.000Z); 
inputWatermark:-290308-12-21T19:59:05.225Z; 
outputWatermark:-290308-12-21T19:59:05.225Z




‫בתאריך יום ה׳, 16 במאי 2024 ב-17:06 מאת ‪Jan Lukavský‬‏ 
<‪je...@seznam.cz‬‏>:‬


Does using --experiments=use_deprecated_read have any effect?

On 5/16/24 14:30, Yarden BenMoshe wrote:

Hi Jan, my PipelineOptions is as follows:
options.setStreaming(true);
options.setAttachedMode(false);
options.setRunner(FlinkRunner.class);

I've also tried adding:
options.setAutoWatermarkInterval(100L);
as seen in some github issue, without any success so far.

other than that, i am working with parallelism:3 and number of
task slots: 3

Thanks!
Yarden

‫בתאריך יום ה׳, 16 במאי 2024 ב-15:05 מאת ‪Jan Lukavský‬‏
<‪je...@seznam.cz‬‏>:‬

Hi Yarden,

can you please provide all flink-related PipelineOptions you
use for the
job?

  Jan

On 5/16/24 13:44, Yarden BenMoshe wrote:
> Hi all,
> I have a project running with Beam 2.51, using Flink
runner. In one of
> my pipelines i have a FixedWindow and had a problem
upgrading until
> now, with a timers issue now resolved, and hopefully
allowing me to
> upgrade to version 2.56
> However, I encounter another problem now which I believe is
related to
> watermarking(?).
> My pipeline's source is a kafka topic.
> My basic window definition is:
>
> PCollection>>
windowCustomObjectInfo
> = customObject.apply("windowCustomObjectInfo",
>

Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());
>
> and ever since upgrading to version 2.56 I am not getting
any output
> from that window. when enabling TRACE logs, i have this
message:
>
> 2024-05-12 13:50:55,257 TRACE
org.apache.beam.sdk.util.WindowTracing
> [] - WatermarkHold.addHolds: element hold at
2024-05-12T13:50:59.999Z
> is on time for key:test-12345;
> window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z);
> inputWatermark:-290308-12-21T19:59:05.225Z;
> outputWatermark:-290308-12-21T19:59:05.225Z
>
>
> Any hints on where should I look or maybe how I can adjust
my window
> definition? Are you familiar with any change that might be
the cause
> for my issue?
> Thanks


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-19 Thread Yarden BenMoshe
I am not running my pipeline from command-line, so used instead:
options.setExperiments(Collections.singletonList("use_deprecated_read"));

with ExperimentalOptions added to my options interface, however I dont
think there's any effect to using it. in terms of the watermark, i received
again:
WatermarkHold.addHolds: element hold at 2024-05-19T07:52:59.999Z is on time
for key:aaa-bbb-ccc;
window:[2024-05-19T07:52:00.000Z..2024-05-19T07:53:00.000Z);
inputWatermark:-290308-12-21T19:59:05.225Z;
outputWatermark:-290308-12-21T19:59:05.225Z



‫בתאריך יום ה׳, 16 במאי 2024 ב-17:06 מאת ‪Jan Lukavský‬‏ <‪je...@seznam.cz
‬‏>:‬

> Does using --experiments=use_deprecated_read have any effect?
> On 5/16/24 14:30, Yarden BenMoshe wrote:
>
> Hi Jan, my PipelineOptions is as follows:
> options.setStreaming(true);
> options.setAttachedMode(false);
> options.setRunner(FlinkRunner.class);
>
> I've also tried adding:
> options.setAutoWatermarkInterval(100L);
> as seen in some github issue, without any success so far.
>
> other than that, i am working with parallelism:3 and number of task
> slots: 3
>
> Thanks!
> Yarden
>
> ‫בתאריך יום ה׳, 16 במאי 2024 ב-15:05 מאת ‪Jan Lukavský‬‏ <‪je...@seznam.cz
> ‬‏>:‬
>
>> Hi Yarden,
>>
>> can you please provide all flink-related PipelineOptions you use for the
>> job?
>>
>>   Jan
>>
>> On 5/16/24 13:44, Yarden BenMoshe wrote:
>> > Hi all,
>> > I have a project running with Beam 2.51, using Flink runner. In one of
>> > my pipelines i have a FixedWindow and had a problem upgrading until
>> > now, with a timers issue now resolved, and hopefully allowing me to
>> > upgrade to version 2.56
>> > However, I encounter another problem now which I believe is related to
>> > watermarking(?).
>> > My pipeline's source is a kafka topic.
>> > My basic window definition is:
>> >
>> > PCollection>> windowCustomObjectInfo
>> > = customObject.apply("windowCustomObjectInfo",
>> >
>> Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());
>> >
>> > and ever since upgrading to version 2.56 I am not getting any output
>> > from that window. when enabling TRACE logs, i have this message:
>> >
>> > 2024-05-12 13:50:55,257 TRACE org.apache.beam.sdk.util.WindowTracing
>> > [] - WatermarkHold.addHolds: element hold at 2024-05-12T13:50:59.999Z
>> > is on time for key:test-12345;
>> > window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z);
>> > inputWatermark:-290308-12-21T19:59:05.225Z;
>> > outputWatermark:-290308-12-21T19:59:05.225Z
>> >
>> >
>> > Any hints on where should I look or maybe how I can adjust my window
>> > definition? Are you familiar with any change that might be the cause
>> > for my issue?
>> > Thanks
>>
>


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-16 Thread Jan Lukavský

Does using --experiments=use_deprecated_read have any effect?

On 5/16/24 14:30, Yarden BenMoshe wrote:

Hi Jan, my PipelineOptions is as follows:
options.setStreaming(true);
options.setAttachedMode(false);
options.setRunner(FlinkRunner.class);

I've also tried adding:
options.setAutoWatermarkInterval(100L);
as seen in some github issue, without any success so far.

other than that, i am working with parallelism:3 and number of task 
slots: 3


Thanks!
Yarden

‫בתאריך יום ה׳, 16 במאי 2024 ב-15:05 מאת ‪Jan Lukavský‬‏ 
<‪je...@seznam.cz‬‏>:‬


Hi Yarden,

can you please provide all flink-related PipelineOptions you use
for the
job?

  Jan

On 5/16/24 13:44, Yarden BenMoshe wrote:
> Hi all,
> I have a project running with Beam 2.51, using Flink runner. In
one of
> my pipelines i have a FixedWindow and had a problem upgrading until
> now, with a timers issue now resolved, and hopefully allowing me to
> upgrade to version 2.56
> However, I encounter another problem now which I believe is
related to
> watermarking(?).
> My pipeline's source is a kafka topic.
> My basic window definition is:
>
> PCollection>>
windowCustomObjectInfo
> = customObject.apply("windowCustomObjectInfo",
>

Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());
>
> and ever since upgrading to version 2.56 I am not getting any
output
> from that window. when enabling TRACE logs, i have this message:
>
> 2024-05-12 13:50:55,257 TRACE
org.apache.beam.sdk.util.WindowTracing
> [] - WatermarkHold.addHolds: element hold at
2024-05-12T13:50:59.999Z
> is on time for key:test-12345;
> window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z);
> inputWatermark:-290308-12-21T19:59:05.225Z;
> outputWatermark:-290308-12-21T19:59:05.225Z
>
>
> Any hints on where should I look or maybe how I can adjust my
window
> definition? Are you familiar with any change that might be the
cause
> for my issue?
> Thanks


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-16 Thread Yarden BenMoshe
Hi Jan, my PipelineOptions is as follows:
options.setStreaming(true);
options.setAttachedMode(false);
options.setRunner(FlinkRunner.class);

I've also tried adding:
options.setAutoWatermarkInterval(100L);
as seen in some github issue, without any success so far.

other than that, i am working with parallelism:3 and number of task slots: 3

Thanks!
Yarden

‫בתאריך יום ה׳, 16 במאי 2024 ב-15:05 מאת ‪Jan Lukavský‬‏ <‪je...@seznam.cz
‬‏>:‬

> Hi Yarden,
>
> can you please provide all flink-related PipelineOptions you use for the
> job?
>
>   Jan
>
> On 5/16/24 13:44, Yarden BenMoshe wrote:
> > Hi all,
> > I have a project running with Beam 2.51, using Flink runner. In one of
> > my pipelines i have a FixedWindow and had a problem upgrading until
> > now, with a timers issue now resolved, and hopefully allowing me to
> > upgrade to version 2.56
> > However, I encounter another problem now which I believe is related to
> > watermarking(?).
> > My pipeline's source is a kafka topic.
> > My basic window definition is:
> >
> > PCollection>> windowCustomObjectInfo
> > = customObject.apply("windowCustomObjectInfo",
> >
> Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());
> >
> > and ever since upgrading to version 2.56 I am not getting any output
> > from that window. when enabling TRACE logs, i have this message:
> >
> > 2024-05-12 13:50:55,257 TRACE org.apache.beam.sdk.util.WindowTracing
> > [] - WatermarkHold.addHolds: element hold at 2024-05-12T13:50:59.999Z
> > is on time for key:test-12345;
> > window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z);
> > inputWatermark:-290308-12-21T19:59:05.225Z;
> > outputWatermark:-290308-12-21T19:59:05.225Z
> >
> >
> > Any hints on where should I look or maybe how I can adjust my window
> > definition? Are you familiar with any change that might be the cause
> > for my issue?
> > Thanks
>


Re: KafkaIO/FixedWindow changes 2.56?

2024-05-16 Thread Jan Lukavský

Hi Yarden,

can you please provide all flink-related PipelineOptions you use for the 
job?


 Jan

On 5/16/24 13:44, Yarden BenMoshe wrote:

Hi all,
I have a project running with Beam 2.51, using Flink runner. In one of 
my pipelines i have a FixedWindow and had a problem upgrading until 
now, with a timers issue now resolved, and hopefully allowing me to 
upgrade to version 2.56
However, I encounter another problem now which I believe is related to 
watermarking(?).

My pipeline's source is a kafka topic.
My basic window definition is:

PCollection>> windowCustomObjectInfo 
= customObject.apply("windowCustomObjectInfo", 
Window.into(FixedWindows.of(Duration.standardSeconds(60.apply(GroupByKey.create());


and ever since upgrading to version 2.56 I am not getting any output 
from that window. when enabling TRACE logs, i have this message:


2024-05-12 13:50:55,257 TRACE org.apache.beam.sdk.util.WindowTracing 
[] - WatermarkHold.addHolds: element hold at 2024-05-12T13:50:59.999Z 
is on time for key:test-12345; 
window:[2024-05-12T13:50:00.000Z..2024-05-12T13:51:00.000Z); 
inputWatermark:-290308-12-21T19:59:05.225Z; 
outputWatermark:-290308-12-21T19:59:05.225Z



Any hints on where should I look or maybe how I can adjust my window 
definition? Are you familiar with any change that might be the cause 
for my issue?

Thanks


Re: Fails to deploy a python pipeline to a flink cluster

2024-05-11 Thread Jaehyeon Kim
Hi XQ

I haven't changed anything and the issue would persist on my end. The print
stuff is called only when self.verbose is True and, by default, it is False.

BTW Do you have any idea about the error message? I haven't seen such error.

Cheers,
Jaehyeon

On Sun, 12 May 2024, 12:15 am XQ Hu via user,  wrote:

> Do you still have the same issue? I tried to follow your setup.sh to
> reproduce this but somehow I am stuck at the word_len step. I saw you also
> tried to use `print(kafka_kv)` to debug it. I am not sure about your
> current status.
>
> On Fri, May 10, 2024 at 9:18 AM Jaehyeon Kim  wrote:
>
>> Hello,
>>
>> I'm playing with deploying a python pipeline to a flink cluster on
>> kubernetes via flink kubernetes operator. The pipeline simply calculates
>> average word lengths in a fixed time window of 5 seconds and it works with
>> the embedded flink cluster.
>>
>> First, I created a k8s cluster (v1.25.3) on minikube and a docker image
>> named beam-python-example:1.17 created using the following docker file -
>> the full details can be checked in
>> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/Dockerfile
>>
>> The java sdk is used for the sdk harness of the kafka io's expansion
>> service while the job server is used to execute the python pipeline in the
>> flink operator.
>>
>> FROM flink:1.17
>> ...
>> ## add java SDK and job server
>> COPY --from=apache/beam_java8_sdk:2.56.0 /opt/apache/beam/
>> /opt/apache/beam/
>>
>> COPY --from=apache/beam_flink1.17_job_server:2.56.0  \
>>   /opt/apache/beam/jars/beam-runners-flink-job-server.jar
>> /opt/apache/beam/jars/beam-runners-flink-job-server.jar
>>
>> RUN chown -R flink:flink /opt/apache/beam
>>
>> ## install python 3.10.13
>> RUN apt-get update -y && \
>>   apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev
>> libffi-dev liblzma-dev && \
>>   wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-
>> ${PYTHON_VERSION}.tgz && \
>> ...
>> ## install apache beam 2.56.0
>> RUN pip3 install apache-beam==${BEAM_VERSION}
>>
>> ## copy pipeline source
>> RUN mkdir /opt/flink/app
>> COPY word_len.py /opt/flink/app/
>>
>> Then the pipeline is deployed using the following manifest - the full
>> details can be found in
>> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/word_len.yml
>>
>> apiVersion: flink.apache.org/v1beta1
>> kind: FlinkDeployment
>> metadata:
>>   name: beam-word-len
>> spec:
>>   image: beam-python-example:1.17
>>   imagePullPolicy: Never
>>   flinkVersion: v1_17
>>   flinkConfiguration:
>> taskmanager.numberOfTaskSlots: "5"
>>   serviceAccount: flink
>>   podTemplate:
>> spec:
>>   containers:
>> - name: flink-main-container
>>   env:
>> - name: BOOTSTRAP_SERVERS
>>   value: demo-cluster-kafka-bootstrap:9092
>> ...
>>   jobManager:
>> resource:
>>   memory: "2048m"
>>   cpu: 1
>>   taskManager:
>> replicas: 2
>> resource:
>>   memory: "2048m"
>>   cpu: 1
>> podTemplate:
>>   spec:
>> containers:
>>   - name: python-worker-harness
>> image: apache/beam_python3.10_sdk:2.56.0
>> imagePullPolicy: Never
>> args: ["--worker_pool"]
>> ports:
>>   - containerPort: 5
>>
>>   job:
>> jarURI:
>> local:///opt/apache/beam/jars/beam-runners-flink-job-server.jar
>> entryClass:
>> org.apache.beam.runners.flink.FlinkPortableClientEntryPoint
>> args:
>>   - "--driver-cmd"
>>   - "python /opt/flink/app/word_len.py --deploy"
>> parallelism: 3
>> upgradeMode: stateless
>>
>> Here is the pipeline source - the full details can be found in
>> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/word_len.py
>>
>> When I add the --deploy flag, the python sdk harness is set to EXTERNAL
>> and its config is set to localhost:5 - I believe it'll point to the
>> side car container of the task manager. For the kafka io, the expansion
>> service's sdk harness is configured as PROCESS and the command points to
>> the java sdk that is added in the beam-python-example:1.17 image.
>>
>> ...
>> def run(args=None):
>> parser = argparse.ArgumentParser(description="Beam pipeline
>> arguments")
>> parser.add_argument("--runner", default="FlinkRunner", help="Apache
>> Beam runner")
>> parser.add_argument(
>> "--deploy",
>> action="store_true",
>> default="Flag to indicate whether to use an own local cluster",
>> )
>> opts, _ = parser.parse_known_args(args)
>>
>> pipeline_opts = {
>> "runner": opts.runner,
>> "job_name": "avg-word-length-beam",
>> "streaming": True,
>> "environment_type": "EXTERNAL" if opts.deploy is True else
>> "LOOPBACK",
>> "checkpointing_interval": "6",
>> }
>>
>> expansion_service = None
>> if pipeline_opts["environment_type"] == "EX

Re: Fails to deploy a python pipeline to a flink cluster

2024-05-11 Thread XQ Hu via user
Do you still have the same issue? I tried to follow your setup.sh to
reproduce this but somehow I am stuck at the word_len step. I saw you also
tried to use `print(kafka_kv)` to debug it. I am not sure about your
current status.

On Fri, May 10, 2024 at 9:18 AM Jaehyeon Kim  wrote:

> Hello,
>
> I'm playing with deploying a python pipeline to a flink cluster on
> kubernetes via flink kubernetes operator. The pipeline simply calculates
> average word lengths in a fixed time window of 5 seconds and it works with
> the embedded flink cluster.
>
> First, I created a k8s cluster (v1.25.3) on minikube and a docker image
> named beam-python-example:1.17 created using the following docker file -
> the full details can be checked in
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/Dockerfile
>
> The java sdk is used for the sdk harness of the kafka io's expansion
> service while the job server is used to execute the python pipeline in the
> flink operator.
>
> FROM flink:1.17
> ...
> ## add java SDK and job server
> COPY --from=apache/beam_java8_sdk:2.56.0 /opt/apache/beam/
> /opt/apache/beam/
>
> COPY --from=apache/beam_flink1.17_job_server:2.56.0  \
>   /opt/apache/beam/jars/beam-runners-flink-job-server.jar
> /opt/apache/beam/jars/beam-runners-flink-job-server.jar
>
> RUN chown -R flink:flink /opt/apache/beam
>
> ## install python 3.10.13
> RUN apt-get update -y && \
>   apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev
> libffi-dev liblzma-dev && \
>   wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-
> ${PYTHON_VERSION}.tgz && \
> ...
> ## install apache beam 2.56.0
> RUN pip3 install apache-beam==${BEAM_VERSION}
>
> ## copy pipeline source
> RUN mkdir /opt/flink/app
> COPY word_len.py /opt/flink/app/
>
> Then the pipeline is deployed using the following manifest - the full
> details can be found in
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/word_len.yml
>
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
>   name: beam-word-len
> spec:
>   image: beam-python-example:1.17
>   imagePullPolicy: Never
>   flinkVersion: v1_17
>   flinkConfiguration:
> taskmanager.numberOfTaskSlots: "5"
>   serviceAccount: flink
>   podTemplate:
> spec:
>   containers:
> - name: flink-main-container
>   env:
> - name: BOOTSTRAP_SERVERS
>   value: demo-cluster-kafka-bootstrap:9092
> ...
>   jobManager:
> resource:
>   memory: "2048m"
>   cpu: 1
>   taskManager:
> replicas: 2
> resource:
>   memory: "2048m"
>   cpu: 1
> podTemplate:
>   spec:
> containers:
>   - name: python-worker-harness
> image: apache/beam_python3.10_sdk:2.56.0
> imagePullPolicy: Never
> args: ["--worker_pool"]
> ports:
>   - containerPort: 5
>
>   job:
> jarURI:
> local:///opt/apache/beam/jars/beam-runners-flink-job-server.jar
> entryClass:
> org.apache.beam.runners.flink.FlinkPortableClientEntryPoint
> args:
>   - "--driver-cmd"
>   - "python /opt/flink/app/word_len.py --deploy"
> parallelism: 3
> upgradeMode: stateless
>
> Here is the pipeline source - the full details can be found in
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-deploy/beam-deploy/beam/word_len.py
>
> When I add the --deploy flag, the python sdk harness is set to EXTERNAL
> and its config is set to localhost:5 - I believe it'll point to the
> side car container of the task manager. For the kafka io, the expansion
> service's sdk harness is configured as PROCESS and the command points to
> the java sdk that is added in the beam-python-example:1.17 image.
>
> ...
> def run(args=None):
> parser = argparse.ArgumentParser(description="Beam pipeline arguments"
> )
> parser.add_argument("--runner", default="FlinkRunner", help="Apache
> Beam runner")
> parser.add_argument(
> "--deploy",
> action="store_true",
> default="Flag to indicate whether to use an own local cluster",
> )
> opts, _ = parser.parse_known_args(args)
>
> pipeline_opts = {
> "runner": opts.runner,
> "job_name": "avg-word-length-beam",
> "streaming": True,
> "environment_type": "EXTERNAL" if opts.deploy is True else
> "LOOPBACK",
> "checkpointing_interval": "6",
> }
>
> expansion_service = None
> if pipeline_opts["environment_type"] == "EXTERNAL":
> pipeline_opts = {
> **pipeline_opts,
> **{
> "environment_config": "localhost:5",
> "flink_submit_uber_jar": True,
> },
> }
> expansion_service = kafka.default_io_expansion_service(
> append_args=[
> "--defaultEnvironmentType=PROCESS",
>
> '--defaultEnvironmentConfig={"command":"/opt/apache/beam/boot"}',
> 

Re: Query about `JdbcIO.PoolableDataSourceProvider`

2024-05-08 Thread Yi Hu
Hi Vardhan,

I checked the source code and history of PoolableDataSourceProvider, here is my 
finding

- PoolableDataSourceProvider is already a static singleton [1], which means it 
is one connection for each DataSourceConfiguration, per worker. More 
specifically, multiple threads within a worker should share a connection, if 
connect to the same database. PoolableDataSourceProvider should also support 
connecting to different databases, because the underlying singleton Map is 
keyed by DataSourceConfiguration.

- However, I notice there is another open issue [2] claiming "the current 
implementation default parameters cannot cover all cases". I am wondering if 
this is the case and leads to "overwhelm the source db" you observe ?

[1] https://github.com/apache/beam/pull/8635

[2] https://github.com/apache/beam/issues/19393

In any case, one can define their own withDataSourceProviderFn (as mentioned by 
[2]) that implements a custom connection pool.

Best,
Yi

On 2024/05/04 12:18:47 Vardhan Thigle via user wrote:
> Hi Beam Experts,
> 
> I had a small query about `JdbcIO.PoolableDataSourceProvider`
> 
> As per main the documentation of JdbcIO
> ,
> (IIUC) `JdbcIO.PoolableDataSourceProvider` creates one DataSource per
> execution thread by default which can overwhelm the source db.
> 
> Where As
> 
> As per the Java doc of
> 
> JdbcIO.PoolableDataSourceProvider,
> 
> 
> 
> At most a single DataSource instance will be constructed during pipeline
> execution for each unique JdbcIO.DataSourceConfiguration
> 
> within
> the pipeline.
> 
> If I want a singleton poolable connection for a given source database and
> my pipeline is dealing with multiple source databases, do I need to wrap
> the `JdbcIO.PoolableDataSourceProvider` in another concurrent hash map
> (from the implementation it looks lit that's what it does already and it's
> not needed)?I am a bit confused due to the variation in the 2 docs above
> (it's quite possible that I am interpreting them wrong)
> Would it be more recommended to rollout a custom class as suggested in the
> main documentation of JdbcIO
> ,
> in cases like:1. configure the poolconfig 2. Use an alternative source like
> say Hikari which If I understand correctly is not possible with
> JdbcIO.PoolableDataSourceProvider
> 
> .
> 
> 
> 
> 
> Regards and Thanks,
> Vardhan Thigle,
> +919535346204
> 


Re: Pipeline gets stuck when chaining two SDFs (Python SDK)

2024-05-05 Thread XQ Hu via user
I added this issue here
https://github.com/apache/beam/issues/24528#issuecomment-2095026324
But we do not plan to fix this for Python DirectRunner since Prism will
become the default local runner when it is ready.

On Sun, May 5, 2024 at 2:41 PM Jaehyeon Kim  wrote:

> Hi XQ
>
> Yes, it works with the FlinkRunner. Thank you so much!
>
> Cheers,
> Jaehyeon
>
> [image: image.png]
>
> On Mon, 6 May 2024 at 02:49, XQ Hu via user  wrote:
>
>> Have you tried to use other runners? I think this might be caused by some
>> gaps in Python DirectRunner to support the streaming cases or SDFs,
>>
>> On Sun, May 5, 2024 at 5:10 AM Jaehyeon Kim  wrote:
>>
>>> Hi XQ
>>>
>>> Thanks for checking it out. SDFs chaining seems to work as I created my
>>> pipeline while converting a pipeline that is built in the Java SDK. The
>>> source of the Java pipeline can be found in
>>> https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam/blob/main/chapter7/src/main/java/com/packtpub/beam/chapter7/StreamingFileRead.java
>>>
>>> So far, when I yield outputs, the second SDF gets stuck while it gets
>>> executed if I return them (but the first SDF completes). If I change the
>>> second SDF into a do function without adding the tracker, it is executed
>>> fine. Not sure what happens in the first scenario.
>>>
>>> Cheers,
>>> Jaehyeon
>>>
>>> On Sun, 5 May 2024 at 09:21, XQ Hu via user 
>>> wrote:
>>>
 I played with your example. Indeed, create_tracker in
 your ProcessFilesFn is never called, which is quite strange.
 I could not find any example that shows the chained SDFs, which makes
 me wonder whether the chained SDFs work.

 @Chamikara Jayalath  Any thoughts?

 On Fri, May 3, 2024 at 2:45 AM Jaehyeon Kim  wrote:

> Hello,
>
> I am building a pipeline using two SDFs that are chained. The first
> function (DirectoryWatchFn) checks a folder continuously and grabs if a 
> new
> file is added. The second one (ProcessFilesFn) processes a file
> while splitting each line - the processing simply prints the file name and
> line number.
>
> The process function of the first SDF gets stuck if I yield a new file
> object. Specifically, although the second SDF is called as I can check the
> initial restriction is created, the tracker is not created at all!
>
> On the other hand, if I return the file object list, the second SDF
> works fine but the issue is the first SDF stops as soon as it returns the
> first list of files.
>
> The source of the pipeline can be found in
> - First SDF:
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/directory_watch.py
> - Second SDF:
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/file_read.py
> - Pipeline:
> https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/streaming_file_read.py
>
> Can you please inform me how to handle this issue?
>
> Cheers,
> Jaehyeon
>
> class DirectoryWatchFn(beam.DoFn):
> POLL_TIMEOUT = 10
>
> @beam.DoFn.unbounded_per_element()
> def process(
> self,
> element: str,
> tracker: RestrictionTrackerView = beam.DoFn.RestrictionParam(
> DirectoryWatchRestrictionProvider()
> ),
> watermark_estimater: WatermarkEstimatorProvider = beam.DoFn.
> WatermarkEstimatorParam(
> DirectoryWatchWatermarkEstimatorProvider()
> ),
> ) -> typing.Iterable[MyFile]:
> new_files = self._get_new_files_if_any(element, tracker)
> if self._process_new_files(tracker, watermark_estimater,
> new_files):
> # return [new_file[0] for new_file in new_files] #<-- it
> doesn't get stuck but the SDF finishes
> for new_file in new_files: #<--- it gets stuck if
> yielding file objects
> yield new_file[0]
> else:
> return
> tracker.defer_remainder(Duration.of(self.POLL_TIMEOUT))
>
> def _get_new_files_if_any(
> self, element: str, tracker: DirectoryWatchRestrictionTracker
> ) -> typing.List[typing.Tuple[MyFile, Timestamp]]:
> new_files = []
> for file in os.listdir(element):
> if (
> os.path.isfile(os.path.join(element, file))
> and file not in tracker.current_restriction().
> already_processed
> ):
> num_lines = sum(1 for _ in open(os.path.join(element,
> file)))
> new_file = MyFile(file, 0, num_lines)
> print(new_file)
> new_files.append(
> (
> new_file,
> Tim

Re: Pipeline gets stuck when chaining two SDFs (Python SDK)

2024-05-05 Thread Jaehyeon Kim
Hi XQ

Yes, it works with the FlinkRunner. Thank you so much!

Cheers,
Jaehyeon

[image: image.png]

On Mon, 6 May 2024 at 02:49, XQ Hu via user  wrote:

> Have you tried to use other runners? I think this might be caused by some
> gaps in Python DirectRunner to support the streaming cases or SDFs,
>
> On Sun, May 5, 2024 at 5:10 AM Jaehyeon Kim  wrote:
>
>> Hi XQ
>>
>> Thanks for checking it out. SDFs chaining seems to work as I created my
>> pipeline while converting a pipeline that is built in the Java SDK. The
>> source of the Java pipeline can be found in
>> https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam/blob/main/chapter7/src/main/java/com/packtpub/beam/chapter7/StreamingFileRead.java
>>
>> So far, when I yield outputs, the second SDF gets stuck while it gets
>> executed if I return them (but the first SDF completes). If I change the
>> second SDF into a do function without adding the tracker, it is executed
>> fine. Not sure what happens in the first scenario.
>>
>> Cheers,
>> Jaehyeon
>>
>> On Sun, 5 May 2024 at 09:21, XQ Hu via user  wrote:
>>
>>> I played with your example. Indeed, create_tracker in
>>> your ProcessFilesFn is never called, which is quite strange.
>>> I could not find any example that shows the chained SDFs, which makes me
>>> wonder whether the chained SDFs work.
>>>
>>> @Chamikara Jayalath  Any thoughts?
>>>
>>> On Fri, May 3, 2024 at 2:45 AM Jaehyeon Kim  wrote:
>>>
 Hello,

 I am building a pipeline using two SDFs that are chained. The first
 function (DirectoryWatchFn) checks a folder continuously and grabs if a new
 file is added. The second one (ProcessFilesFn) processes a file
 while splitting each line - the processing simply prints the file name and
 line number.

 The process function of the first SDF gets stuck if I yield a new file
 object. Specifically, although the second SDF is called as I can check the
 initial restriction is created, the tracker is not created at all!

 On the other hand, if I return the file object list, the second SDF
 works fine but the issue is the first SDF stops as soon as it returns the
 first list of files.

 The source of the pipeline can be found in
 - First SDF:
 https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/directory_watch.py
 - Second SDF:
 https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/file_read.py
 - Pipeline:
 https://github.com/jaehyeon-kim/beam-demos/blob/feature/beam-pipeline/beam-pipelines/chapter7/streaming_file_read.py

 Can you please inform me how to handle this issue?

 Cheers,
 Jaehyeon

 class DirectoryWatchFn(beam.DoFn):
 POLL_TIMEOUT = 10

 @beam.DoFn.unbounded_per_element()
 def process(
 self,
 element: str,
 tracker: RestrictionTrackerView = beam.DoFn.RestrictionParam(
 DirectoryWatchRestrictionProvider()
 ),
 watermark_estimater: WatermarkEstimatorProvider = beam.DoFn.
 WatermarkEstimatorParam(
 DirectoryWatchWatermarkEstimatorProvider()
 ),
 ) -> typing.Iterable[MyFile]:
 new_files = self._get_new_files_if_any(element, tracker)
 if self._process_new_files(tracker, watermark_estimater,
 new_files):
 # return [new_file[0] for new_file in new_files] #<-- it
 doesn't get stuck but the SDF finishes
 for new_file in new_files: #<--- it gets stuck if yielding
 file objects
 yield new_file[0]
 else:
 return
 tracker.defer_remainder(Duration.of(self.POLL_TIMEOUT))

 def _get_new_files_if_any(
 self, element: str, tracker: DirectoryWatchRestrictionTracker
 ) -> typing.List[typing.Tuple[MyFile, Timestamp]]:
 new_files = []
 for file in os.listdir(element):
 if (
 os.path.isfile(os.path.join(element, file))
 and file not in tracker.current_restriction().
 already_processed
 ):
 num_lines = sum(1 for _ in open(os.path.join(element,
 file)))
 new_file = MyFile(file, 0, num_lines)
 print(new_file)
 new_files.append(
 (
 new_file,
 Timestamp.of(os.path.getmtime(os.path.join(
 element, file))),
 )
 )
 return new_files

 def _process_new_files(
 self,
 tracker: DirectoryWatchRestrictionTracker,
 watermark_estimater: ManualWatermarkEstimator,
 new_files: typing.List[typing.Tuple[MyFile, Timestamp]],
 ):
>>

  1   2   3   4   5   6   7   8   9   10   >