Hi Fabian, Could you provide a bit more information? in the past couple of weeks, some major changes have been made to improve the performance. Are you using a Hop local engine configuration when executing the pipeline executor or trying the Beam-Direct? If it is the second I fear that's not really supported currently, or definitely untested.
That being said, Beam Direct is an engine type mainly for testing implementation not made for actual heavy lifting. I would test implementation with a couple of files and do the actual heavy processing using Dataflow, Spark, or Flink. In one of our next releases, we are planning to add an "Advisor" which will warn on transforms we have not yet tested. Or that we know will not always give the expected results. Cheers, Hans On Mon, 10 Oct 2022 at 10:28, Fabian Peters <[email protected]> wrote: > Hi all, > > I'm trying to process a few hundred Avro files on GCS. They are getting > decoded and two simple filters are being applied. When running this on > Beam-Direct, all heap space is getting filled within a minute or two. I > threw 58 GB at it before giving up. > > To limit the number of files getting processed at once, I have moved the > actual processing into a pipeline executor. Alas, when running on > Beam-Direct, it looks like the transforms are only initialised but do not > get executed. This concerns Write to Log, JavaScript, HTTP Client and > BigQuery Output. Everything behaves as expected when I configure the > pipeline executor to use the Local runner. > > So, two questions: Is the pipeline executor transform incompatible with > Beam? And, are there other approaches for limiting memory use in such a > case? > > cheers > > Fabian
