Maybe the solution implemented on JdbcIO [1], [2] could be helpful in this
cases.
[1] https://issues.apache.org/jira/browse/BEAM-2803
[2]
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1088-L1118
On Fri, May 10, 2019 at 11:36 PM
Hello Maximilian,
Thanks for your help.
The other part of my question was with running (Python) pipeline on
Flink-cluster runner. I read that page
https://beam.apache.org/documentation/runners/flink/ but felt confused.
Will try one more time and then come back if I am still stuck with it.
Again,
There is no such flag to turn of fusion.
Writing 100s of GiBs of uncompressed data to reshuffle will take time when
it is limited to a small number of workers.
If you can split up your input into a lot of smaller files that are
compressed then you shouldn't need to use the reshuffle but still cou
The best solution would be to find a compression format that is splittable
and add support for that to Apache Beam and use it. The issue with
compressed files is that you can't read from an arbitrary offset. This
stack overflow post[1] has some suggestions on seekable compression
libraries.
A much
When you had X gzip files and were not using Reshuffle, did you see X
workers read and process the files?
On Fri, May 10, 2019 at 1:17 PM Allie Chen wrote:
> Yes, I do see the data after reshuffle are processed in parallel. But
> Reshuffle transform itself takes hours or even days to run, accord
+user@beam.apache.org
Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all
the data has been read before the next transforms can run. After the
reshuffle, the data should have been processed in parallel across the
workers. Did you see this?
Are you able to change the input of
Hi Averell,
What you want to do is possible today but at this point is an early
experimental feature. The reason for that is that Kafka is a
cross-language Java transform in a Python pipeline. We just recently
enabled cross-language pipelines.
1. First of all, until 2.13.0 is released you wi
Hi everyone,
I am trying to get started with Python on Flink-cluster runner, to build a
pipeline that reads data from Kafka and write to S3 in parquet format.
I tried to search on Beam website, but could not find any example (even for the
basic word count). E.g, in this page
https://beam.apache