Re: Beam Flink vs Beam Spark Benchmarking

Jean-Baptiste Onofré Fri, 27 May 2016 01:34:18 -0700

Hi Ismaël,

as discussed together, clearly, the pipeline code should not use arunner specific pipeline options object, in order to be runner agnostic.


Something like:

SparkPipelineOptions options =PipelineOptionsFactory.as(SparkPipelineOptions.class)


should not be used.
It's better to use something like:

PipelineOptions options =PipelineOptionsFactory.fromArgs(args).withValidation().create();



However, I think we may improve a bit the factory.

Regards
JB

On 05/27/2016 10:22 AM, Ismaël Mejía wrote:


I passed last week running tests on multiple runners, and theoretically
you should not change many things, however you must take care of not
mixing runner specific dependencies while you create your project (e.g.
you don't want to mix specific classes like FlinkPipelineOptions or
SparkPipelineOptions in your code).

About specific good practices of how to benchmark things this is a more
tricky subject, e.g. you must be sure that both runners are using at
least similar parallelism levels. Of course there are many dimensions in
benchmarking and in particular in this space, the real question you have
to start with is what do you want to benchmark (throughput, resource
utilisation, etc) ? Is your pipeline batch only or streaming too ?. And
then try to create an scenario that you can reproduce where you expect a
similar behaviour among runners.

But one thing is clear, you have to expect some differences since the
internal model of each runner is different as well as their maturity
level (at least at this point).

Ismaël


On Fri, May 27, 2016 at 1:19 AM, amir bahmanyari <[email protected]
<mailto:[email protected]>> wrote:

    Hi Colleagues,
    I have implemented the Java version of the MIT's Linear Road
    algorithm as a Beam app.
    I sanity tested it in a Flink Cluster (FlinkRunner). Works fine.
    Receives tuples from Kafka, executes the LR algorithm, and produces
    the correct results.
    I would like to repeat the same in a Spark cluster.
    I am assuming that, other than changing the type of the Runner
    (Flink vs Spark) at runtime, I should not make any code changes.
    Is that the right assumption based on what Beam is promising
    regarding unifying of the underlying streaming engines?

    The real question is: What should I take into consideration if I
    want to Benchmark Flink vs Spark by executing my same Beam LR app in
    both engines?
    How would you approach the benchmarking process? What would you be
    looking for to compare? etc.
    Thanks so much for your valuable time.
    Amir-



On Fri, May 27, 2016 at 1:19 AM, amir bahmanyari <[email protected]
<mailto:[email protected]>> wrote:

    Hi Colleagues,
    I have implemented the Java version of the MIT's Linear Road
    algorithm as a Beam app.
    I sanity tested it in a Flink Cluster (FlinkRunner). Works fine.
    Receives tuples from Kafka, executes the LR algorithm, and produces
    the correct results.
    I would like to repeat the same in a Spark cluster.
    I am assuming that, other than changing the type of the Runner
    (Flink vs Spark) at runtime, I should not make any code changes.
    Is that the right assumption based on what Beam is promising
    regarding unifying of the underlying streaming engines?

    The real question is: What should I take into consideration if I
    want to Benchmark Flink vs Spark by executing my same Beam LR app in
    both engines?
    How would you approach the benchmarking process? What would you be
    looking for to compare? etc.
    Thanks so much for your valuable time.
    Amir-


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam Flink vs Beam Spark Benchmarking

Reply via email to