Abacn merged PR #30802:
URL: https://github.com/apache/beam/pull/30802
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail:
jto commented on PR #30802:
URL: https://github.com/apache/beam/pull/30802#issuecomment-2042134252
Sure. I tested it on a job that consumes ~1B records (~150GB).
With the Dataset API, runtime is 37min.
Passing `--useDataStreamForBatch`, I killed it after 1h+ as it was clearly
too
Abacn commented on PR #30802:
URL: https://github.com/apache/beam/pull/30802#issuecomment-2040486462
Hi, thanks, would you mind sharing some number regarding the performance
difference. e.g. A test case of 20,000,000 elements, and the run time for
different batch sizes
--
This is an
jto commented on PR #30802:
URL: https://github.com/apache/beam/pull/30802#issuecomment-2039755040
Pinging @Abacn since you reviewed my past PRs on the flink runner :)
Can you take a look ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please
github-actions[bot] commented on PR #30802:
URL: https://github.com/apache/beam/pull/30802#issuecomment-2034336637
Assigning reviewers. If you would like to opt out of this review, comment
`assign to next reviewer`:
R: @chamikaramj added as fallback since no labels match
jto opened a new pull request, #30802:
URL: https://github.com/apache/beam/pull/30802
This PR removes the automated file sharding normally applied when the runner
is passed `--useDataStreamForBatch`.
Currently `FlinkStreamingPipelineTranslator.StreamingShardedWriteFactory`