Re: Multiple concurrent transforms with SparkRunner

Jiayue Zhang (Bravo) Wed, 14 Nov 2018 14:21:02 -0800

Hi Juan,

Thanks for your response. Dynamic allocation is already enabled. It's not
about resource allocation but how Spark schedule jobs and stages.


On Tue, Nov 13, 2018 at 10:19 PM Juan Carlos Garcia <[email protected]>
wrote:

> I suggest to play around with some spark configurations like: dynamic
> execution parameters
>
> https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
>
>
>
> Am Mi., 14. Nov. 2018, 02:28 hat Jiayue Zhang (Bravo) <
> [email protected]> geschrieben:
>
>> Hi,
>>
>> I'm writing Java Beam code to run with both Dataflow and Spark. The input
>> files are tfrecord format and are from multiple directories. Java
>> TFRecordIO doesn't have readAll from list of files so what I'm doing is:
>>
>> for (String dir: listOfDirs) {
>>     p.apply(TFRecordIO.read().from(dir))
>>      .apply(ParDo.of(new BatchElements()))
>>      .apply(ParDo.of(new Process()))
>>      .apply(Combine.globally(new CombineResult()))
>>      .apply(TextIO.write().to(dir))
>> }
>>
>> These directories are fairly independent and I only need result of each
>> directory. When running on Dataflow, processing of these directories happen
>> concurrently. But when running with Spark, I saw the spark jobs and stages
>> are sequential. It needs finish all steps in one directory before moving to
>> next one. What's the way to make multiple transforms run concurrently with
>> SparkRunner?
>>
>

Re: Multiple concurrent transforms with SparkRunner

Reply via email to