I suggest to play around with some spark configurations like: dynamic
execution parameters

https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation



Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) <[email protected]>
geschrieben:

> Hi,
>
> I'm writing Java Beam code to run with both Dataflow and Spark. The input
> files are tfrecord format and are from multiple directories. Java
> TFRecordIO doesn't have readAll from list of files so what I'm doing is:
>
> for (String dir: listOfDirs) {
>     p.apply(TFRecordIO.read().from(dir))
>      .apply(ParDo.of(new BatchElements()))
>      .apply(ParDo.of(new Process()))
>      .apply(Combine.globally(new CombineResult()))
>      .apply(TextIO.write().to(dir))
> }
>
> These directories are fairly independent and I only need result of each
> directory. When running on Dataflow, processing of these directories happen
> concurrently. But when running with Spark, I saw the spark jobs and stages
> are sequential. It needs finish all steps in one directory before moving to
> next one. What's the way to make multiple transforms run concurrently with
> SparkRunner?
>

Reply via email to