Hi Ismaël,
as discussed together, clearly, the pipeline code should not use a
runner specific pipeline options object, in order to be runner agnostic.
Something like:
SparkPipelineOptions options =
PipelineOptionsFactory.as(SparkPipelineOptions.class)
should not be used.
It's better to use something like:
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, I think we may improve a bit the factory.
Regards
JB
On 05/27/2016 10:22 AM, Ismaël Mejía wrote:
I passed last week running tests on multiple runners, and theoretically
you should not change many things, however you must take care of not
mixing runner specific dependencies while you create your project (e.g.
you don't want to mix specific classes like FlinkPipelineOptions or
SparkPipelineOptions in your code).
About specific good practices of how to benchmark things this is a more
tricky subject, e.g. you must be sure that both runners are using at
least similar parallelism levels. Of course there are many dimensions in
benchmarking and in particular in this space, the real question you have
to start with is what do you want to benchmark (throughput, resource
utilisation, etc) ? Is your pipeline batch only or streaming too ?. And
then try to create an scenario that you can reproduce where you expect a
similar behaviour among runners.
But one thing is clear, you have to expect some differences since the
internal model of each runner is different as well as their maturity
level (at least at this point).
Ismaël
On Fri, May 27, 2016 at 1:19 AM, amir bahmanyari <[email protected]
<mailto:[email protected]>> wrote:
Hi Colleagues,
I have implemented the Java version of the MIT's Linear Road
algorithm as a Beam app.
I sanity tested it in a Flink Cluster (FlinkRunner). Works fine.
Receives tuples from Kafka, executes the LR algorithm, and produces
the correct results.
I would like to repeat the same in a Spark cluster.
I am assuming that, other than changing the type of the Runner
(Flink vs Spark) at runtime, I should not make any code changes.
Is that the right assumption based on what Beam is promising
regarding unifying of the underlying streaming engines?
The real question is: What should I take into consideration if I
want to Benchmark Flink vs Spark by executing my same Beam LR app in
both engines?
How would you approach the benchmarking process? What would you be
looking for to compare? etc.
Thanks so much for your valuable time.
Amir-
On Fri, May 27, 2016 at 1:19 AM, amir bahmanyari <[email protected]
<mailto:[email protected]>> wrote:
Hi Colleagues,
I have implemented the Java version of the MIT's Linear Road
algorithm as a Beam app.
I sanity tested it in a Flink Cluster (FlinkRunner). Works fine.
Receives tuples from Kafka, executes the LR algorithm, and produces
the correct results.
I would like to repeat the same in a Spark cluster.
I am assuming that, other than changing the type of the Runner
(Flink vs Spark) at runtime, I should not make any code changes.
Is that the right assumption based on what Beam is promising
regarding unifying of the underlying streaming engines?
The real question is: What should I take into consideration if I
want to Benchmark Flink vs Spark by executing my same Beam LR app in
both engines?
How would you approach the benchmarking process? What would you be
looking for to compare? etc.
Thanks so much for your valuable time.
Amir-
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com