Hi Juan, Try to play with the spark.scheduler.mode property (setting it to "FAIR" for instance). By default, Spark runs its jobs/stages in FIFO mode, meaning that in your case, if it can allocate all workers on writing to one of the directory it will perform one directory after the other.
Brgds Juan Carlos Garcia a écrit le 14/11/2018 à 07:18 : I suggest to play around with some spark configurations like: dynamic execution parameters https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_configuration.html-23dynamic-2Dallocation&d=DwMFaQ&c=2w5q_42kFG40MI2alLPgJw&r=Zu9EQzNX_-gUT4k7XupYfw&m=JTCYIc9exlGGYLmLjSM2kQF8yyttDBORBbFPGDg7W-s&s=8J3DeL0js9wc1XVs61z3VCbLtEuCYwrRcqXAY4eOFBk&e=> Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) <[email protected]<mailto:[email protected]>> geschrieben: Hi, I'm writing Java Beam code to run with both Dataflow and Spark. The input files are tfrecord format and are from multiple directories. Java TFRecordIO doesn't have readAll from list of files so what I'm doing is: for (String dir: listOfDirs) { p.apply(TFRecordIO.read().from(dir)) .apply(ParDo.of(new BatchElements())) .apply(ParDo.of(new Process())) .apply(Combine.globally(new CombineResult())) .apply(TextIO.write().to(dir)) } These directories are fairly independent and I only need result of each directory. When running on Dataflow, processing of these directories happen concurrently. But when running with Spark, I saw the spark jobs and stages are sequential. It needs finish all steps in one directory before moving to next one. What's the way to make multiple transforms run concurrently with SparkRunner? As a recipient of an email from Talend, your contact personal data will be on our systems. Please see our contacts privacy notice at Talend, Inc. <https://www.talend.com/contacts-privacy-policy/>
