I suggest to play around with some spark configurations like: dynamic execution parameters
https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) <[email protected]> geschrieben: > Hi, > > I'm writing Java Beam code to run with both Dataflow and Spark. The input > files are tfrecord format and are from multiple directories. Java > TFRecordIO doesn't have readAll from list of files so what I'm doing is: > > for (String dir: listOfDirs) { > p.apply(TFRecordIO.read().from(dir)) > .apply(ParDo.of(new BatchElements())) > .apply(ParDo.of(new Process())) > .apply(Combine.globally(new CombineResult())) > .apply(TextIO.write().to(dir)) > } > > These directories are fairly independent and I only need result of each > directory. When running on Dataflow, processing of these directories happen > concurrently. But when running with Spark, I saw the spark jobs and stages > are sequential. It needs finish all steps in one directory before moving to > next one. What's the way to make multiple transforms run concurrently with > SparkRunner? >
