Re: Python SDF for unbound source

2022-03-09 Thread Sam Bourne
Thanks Ankur for the feedback. Does anyone have any examples they could share writing a SDF for an unbound source (ideally in python)? The SDF I wrote looks something like this. It works fine using the DirectRunner with the PipelineOption --direct_runner_mode='in_memory', but doesn’t seem to

Re: Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Evan Galpin
Great idea Nick, it could definitely work to transform the data within Elasticsearch first and then effectively run a "match_all" against the transformed index within Beam. Agreed that it's not ideal from a user-experience perspective but hopefully serves as another potential path to unblocking

Re: Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Nick Pan
Hi Evan, Thanks for the info. Since I need to support many aggregation types, implementing all the group-by logic in PTransform can be a lot of work. But I will look into the 2nd approach. Another approach I am considering is to directly kick off a single batch transform to

Python SDF for unbound source

2022-03-09 Thread Sam Bourne
Hello! I’m looking for some help writing a custom transform that reads from an unbound source. A basic version of this would look something like this: import apache_beam as beam import myeventlibrary class _ReadEventsFn(beam.DoFn): def process(self, unused_element): subscriber =

Re: Write S3 File with CannedACL

2022-03-09 Thread Yushu Yao
As long as we can use AvroIO to save files to "s3://xx" we are fine. Looks like the JIRA has been around for a while. What is the procedure to contribute from our end? Thanks On Wed, Mar 9, 2022 at 9:59 AM Alexey Romanenko wrote: > Hi Yushu, > > I’m not sure that we have a workaround for that

Re: Write S3 File with CannedACL

2022-03-09 Thread Alexey Romanenko
Hi Yushu, I’m not sure that we have a workaround for that since the related jira issue [1] is still open. Side question: are you interested only in multipart version or both? — Alexey [1] https://issues.apache.org/jira/browse/BEAM-10850 > On 9 Mar 2022, at 00:19, Yushu Yao wrote: > > Hi

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Jan Lukavský
There are two "kinds" of splits in SDF - one splits the restriction *before* being processed and the other *during* processing. The first one is supported (it is needed for correctness) and the other is in bounded case only an optimization (which is not currently supported). It seems to me,

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Janek Bevendorff
Thanks for the response! That's what I feared was going on. I consider this a huge shortcoming, particularly because it does not only affect users with large files like you said. The same happens with many small files, because file globs are also fused to one worker. The only way to process

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Jan Lukavský
Hi Janek, I think you hit a deficiency in the FlinkRunner's SDF implementation. AFAIK the runner is unable to do dynamic splitting, which is what you are probably looking for. What you describe essentially works in the model, but FlinkRunner does not implement the complete contract to make

[ANNOUNCE] Apache Beam 2.37.0 Released

2022-03-09 Thread Brian Hulette
The Apache Beam team is pleased to announce the release of version 2.37.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release

Re: Beam on Flink not processing input splits in parallel

2022-03-09 Thread Janek Bevendorff
I went through all Flink and Beam documentation I could find to see if I overlooked something, but I could not get the text input source to unfuse the file input splits. This creates a huge input bottleneck, because one worker is busy reading records from a huge input file while 99 others wait

Re: Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Evan Galpin
You're correct in your assessment that ElasticsearchIO does not currently support queries with aggregations. There's a large difference between scrolling over large sets of documents (which has a common interface provided by ES) Vs aggregations where user-code in the query will impact the output

Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Nick Pan
Hello, I have a use case where I need to first compute an aggregation for each key, and then filter out the keys based on some criteria. And finally feed the matched keys as an input to PCollection using ElasticsearchIO read. But ElasticsearchIO does not seem to support query that contains