Thanks Ankur for the feedback.
Does anyone have any examples they could share writing a SDF for an unbound
source (ideally in python)?
The SDF I wrote looks something like this. It works fine using the
DirectRunner with the PipelineOption --direct_runner_mode='in_memory', but
doesn’t seem to
Great idea Nick, it could definitely work to transform the data within
Elasticsearch first and then effectively run a "match_all" against the
transformed index within Beam. Agreed that it's not ideal from a
user-experience perspective but hopefully serves as another potential path
to unblocking
Hi Evan,
Thanks for the info. Since I need to support many aggregation types,
implementing all the group-by logic in PTransform can be a lot of work.
But I will look into the 2nd approach.
Another approach I am considering is to directly kick off a single batch
transform to
Hello! I’m looking for some help writing a custom transform that reads from
an unbound source. A basic version of this would look something like this:
import apache_beam as beam
import myeventlibrary
class _ReadEventsFn(beam.DoFn):
def process(self, unused_element):
subscriber =
As long as we can use AvroIO to save files to "s3://xx" we are fine.
Looks like the JIRA has been around for a while. What is the procedure to
contribute from our end?
Thanks
On Wed, Mar 9, 2022 at 9:59 AM Alexey Romanenko
wrote:
> Hi Yushu,
>
> I’m not sure that we have a workaround for that
Hi Yushu,
I’m not sure that we have a workaround for that since the related jira issue
[1] is still open.
Side question: are you interested only in multipart version or both?
—
Alexey
[1] https://issues.apache.org/jira/browse/BEAM-10850
> On 9 Mar 2022, at 00:19, Yushu Yao wrote:
>
> Hi
There are two "kinds" of splits in SDF - one splits the restriction
*before* being processed and the other *during* processing. The first
one is supported (it is needed for correctness) and the other is in
bounded case only an optimization (which is not currently supported). It
seems to me,
Thanks for the response! That's what I feared was going on.
I consider this a huge shortcoming, particularly because it does not
only affect users with large files like you said. The same happens with
many small files, because file globs are also fused to one worker. The
only way to process
Hi Janek,
I think you hit a deficiency in the FlinkRunner's SDF implementation.
AFAIK the runner is unable to do dynamic splitting, which is what you
are probably looking for. What you describe essentially works in the
model, but FlinkRunner does not implement the complete contract to make
The Apache Beam team is pleased to announce the release of version 2.37.0.
Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org
You can download the release
I went through all Flink and Beam documentation I could find to see if I
overlooked something, but I could not get the text input source to
unfuse the file input splits. This creates a huge input bottleneck,
because one worker is busy reading records from a huge input file while
99 others wait
You're correct in your assessment that ElasticsearchIO does not currently
support queries with aggregations. There's a large difference between
scrolling over large sets of documents (which has a common interface
provided by ES) Vs aggregations where user-code in the query will impact
the output
Hello,
I have a use case where I need to first compute an aggregation for each
key, and then filter out the keys based on some criteria. And finally
feed the matched keys as an input to PCollection using ElasticsearchIO
read. But ElasticsearchIO does not seem to support query that contains
13 matches
Mail list logo