Hi Juan, Thanks for your response. Dynamic allocation is already enabled. It's not about resource allocation but how Spark schedule jobs and stages.
On Tue, Nov 13, 2018 at 10:19 PM Juan Carlos Garcia <[email protected]> wrote: > I suggest to play around with some spark configurations like: dynamic > execution parameters > > https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation > > > > Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) < > [email protected]> geschrieben: > >> Hi, >> >> I'm writing Java Beam code to run with both Dataflow and Spark. The input >> files are tfrecord format and are from multiple directories. Java >> TFRecordIO doesn't have readAll from list of files so what I'm doing is: >> >> for (String dir: listOfDirs) { >> p.apply(TFRecordIO.read().from(dir)) >> .apply(ParDo.of(new BatchElements())) >> .apply(ParDo.of(new Process())) >> .apply(Combine.globally(new CombineResult())) >> .apply(TextIO.write().to(dir)) >> } >> >> These directories are fairly independent and I only need result of each >> directory. When running on Dataflow, processing of these directories happen >> concurrently. But when running with Spark, I saw the spark jobs and stages >> are sequential. It needs finish all steps in one directory before moving to >> next one. What's the way to make multiple transforms run concurrently with >> SparkRunner? >> >
