Re: Problem with gzip

2019-05-10 Thread Michael Luckey
Maybe the solution implemented on JdbcIO [1], [2] could be helpful in this cases. [1] https://issues.apache.org/jira/browse/BEAM-2803 [2] https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1088-L1118 On Fri, May 10, 2019 at 11:36 PM

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-10 Thread Averell Huyen Levan
Hello Maximilian, Thanks for your help. The other part of my question was with running (Python) pipeline on Flink-cluster runner. I read that page https://beam.apache.org/documentation/runners/flink/ but felt confused. Will try one more time and then come back if I am still stuck with it. Again,

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
There is no such flag to turn of fusion. Writing 100s of GiBs of uncompressed data to reshuffle will take time when it is limited to a small number of workers. If you can split up your input into a lot of smaller files that are compressed then you shouldn't need to use the reshuffle but still cou

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
The best solution would be to find a compression format that is splittable and add support for that to Apache Beam and use it. The issue with compressed files is that you can't read from an arbitrary offset. This stack overflow post[1] has some suggestions on seekable compression libraries. A much

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
When you had X gzip files and were not using Reshuffle, did you see X workers read and process the files? On Fri, May 10, 2019 at 1:17 PM Allie Chen wrote: > Yes, I do see the data after reshuffle are processed in parallel. But > Reshuffle transform itself takes hours or even days to run, accord

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
+user@beam.apache.org Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all the data has been read before the next transforms can run. After the reshuffle, the data should have been processed in parallel across the workers. Did you see this? Are you able to change the input of

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-10 Thread Maximilian Michels
Hi Averell, What you want to do is possible today but at this point is an early experimental feature. The reason for that is that Kafka is a cross-language Java transform in a Python pipeline. We just recently enabled cross-language pipelines. 1. First of all, until 2.13.0 is released you wi

Wordcount using Python with Flink runner and Kafka source

2019-05-10 Thread lvhuyen
Hi everyone, I am trying to get started with Python on Flink-cluster runner, to build a pipeline that reads data from Kafka and write to S3 in parquet format. I tried to search on Beam website, but could not find any example (even for the basic word count). E.g, in this page https://beam.apache