Re: [PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-04-09 Thread via GitHub
Abacn merged PR #30802: URL: https://github.com/apache/beam/pull/30802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-04-08 Thread via GitHub
jto commented on PR #30802: URL: https://github.com/apache/beam/pull/30802#issuecomment-2042134252 Sure. I tested it on a job that consumes ~1B records (~150GB). With the Dataset API, runtime is 37min. Passing `--useDataStreamForBatch`, I killed it after 1h+ as it was clearly too

Re: [PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-04-05 Thread via GitHub
Abacn commented on PR #30802: URL: https://github.com/apache/beam/pull/30802#issuecomment-2040486462 Hi, thanks, would you mind sharing some number regarding the performance difference. e.g. A test case of 20,000,000 elements, and the run time for different batch sizes -- This is an

Re: [PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-04-05 Thread via GitHub
jto commented on PR #30802: URL: https://github.com/apache/beam/pull/30802#issuecomment-2039755040 Pinging @Abacn since you reviewed my past PRs on the flink runner :) Can you take a look ? -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-04-03 Thread via GitHub
github-actions[bot] commented on PR #30802: URL: https://github.com/apache/beam/pull/30802#issuecomment-2034336637 Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`: R: @chamikaramj added as fallback since no labels match

[PR] [Flink] Speed up file write in batch mode by using larger bundle size [beam]

2024-03-29 Thread via GitHub
jto opened a new pull request, #30802: URL: https://github.com/apache/beam/pull/30802 This PR removes the automated file sharding normally applied when the runner is passed `--useDataStreamForBatch`. Currently `FlinkStreamingPipelineTranslator.StreamingShardedWriteFactory`