Hi Thomas, Thanks for your response. Scheduler mode doesn't help here because Spark only knows it has 1 active job at a time. I wonder if there are ways, like spinning up some threads or providing a custom SparkContext, to make multiple Spark jobs active at the same time.
On Wed, Nov 14, 2018 at 4:18 AM Thomas Fion <[email protected]> wrote: > Hi Juan, > > Try to play with the spark.scheduler.mode property (setting it to "FAIR" > for instance). > By default, Spark runs its jobs/stages in FIFO mode, meaning that in your > case, if it can allocate all workers on writing to one of the directory it > will perform one directory after the other. > > Brgds > > Juan Carlos Garcia a écrit le 14/11/2018 à 07:18 : > > I suggest to play around with some spark configurations like: dynamic > execution parameters > > https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_configuration.html-23dynamic-2Dallocation&d=DwMFaQ&c=2w5q_42kFG40MI2alLPgJw&r=Zu9EQzNX_-gUT4k7XupYfw&m=JTCYIc9exlGGYLmLjSM2kQF8yyttDBORBbFPGDg7W-s&s=8J3DeL0js9wc1XVs61z3VCbLtEuCYwrRcqXAY4eOFBk&e=> > > > > Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) < > [email protected]> geschrieben: > >> Hi, >> >> I'm writing Java Beam code to run with both Dataflow and Spark. The input >> files are tfrecord format and are from multiple directories. Java >> TFRecordIO doesn't have readAll from list of files so what I'm doing is: >> >> for (String dir: listOfDirs) { >> p.apply(TFRecordIO.read().from(dir)) >> .apply(ParDo.of(new BatchElements())) >> .apply(ParDo.of(new Process())) >> .apply(Combine.globally(new CombineResult())) >> .apply(TextIO.write().to(dir)) >> } >> >> These directories are fairly independent and I only need result of each >> directory. When running on Dataflow, processing of these directories happen >> concurrently. But when running with Spark, I saw the spark jobs and stages >> are sequential. It needs finish all steps in one directory before moving to >> next one. What's the way to make multiple transforms run concurrently with >> SparkRunner? >> > > As a recipient of an email from Talend, your contact personal data will be > on our systems. Please see our contacts privacy notice at Talend, Inc. > <https://www.talend.com/contacts-privacy-policy/> > > >
