Hello Beamers,

You'll be pleased to know that I've made some progress in my Beam/Flink SDK 
comparison exercise (maybe now she'll go away...). However, it would be great 
those of you who are more familiar with Beam and Flink could cast your eyes 
over my new code snippets and let me know if you can spot any massive 
discrepancies.


Following the advice from some of you at the Beam Summit last week (special 
thanks to Reuven for taking the time to look into this issue with me), I have 
made the following changes:

  1.  ?Used a KeyedDeserializationSchema implementation in my Flink example, 
instead of the JSONDeserializationSchema? that eagerly deserialised the whole 
object before it was needed.
  2.  Simplified my data model so the records read from Kafka are non-keyed and 
deserialise from byte[ ] to primitives.
  3.  Removed all JSON serialisation and deserialisation from this example.

Both snippets now run much faster, and the performance is equivalent. This is 
much closer to what I was originally expecting, but the question remains: am I 
as close to a like-for-like comparison as I can get? Here are the code 
snippets, amended:

https://gist.github.com/tvergilio/fbb2e855e3d32de223d171d91fd1ec1e

https://gist.github.com/tvergilio/453638bd7cdd28a808ee103775b1fae5


And here is a recap of the experiment (if you care to know the details):

My aim is to write two pipelines which are equivalent, one using Beam, and the 
other using the Flink SDK. I will then run them in the same Flink cluster to 
measure the assumed cost of using an abstraction such as Beam to ensure 
portability of the code (as opposed to using the Flink SDK directly). I am 
using containerised Flink running on cloud-based nodes.


Procedure:

Two measurements are emitted to two different Kafka topics simultaneously: the 
total energy consumption of a server room, and the energy consumption of the IT 
equipment in that room. This data is emitted for 5 minutes at a time, at 
different velocities (every minute, every second, every millisecond, etc), and 
the performance of the task managers in terms of container CPU, memory and 
network usage is measured.


The processing pipeline calculates the Power Usage Effectiveness (PUE) of the 
room. This is simply the total energy consumed divided by the IT energy 
consumed. The PUE is calculated for a given window of time, so there must be 
some notion of per-window state (or else all the records of a window must have 
arrived before the entire calculation is executed). I have used the exact same 
calculation for the PUE for both cases, which can be found here:


https://gist.github.com/tvergilio/1c78ea337e6795de01f06cafdfa4cf84?


and here:


https://gist.github.com/tvergilio/2ed7d4541bc0de14325f82f8aa538d43


Now, theoretically, the PUE calculation could be improved by aggregating the 
real energy readings as they come, doing the same for the IT readings, emitting 
from both when the watermark passes the end of the window, then having a 
separate step calculate the PUE by dividing one aggregation by the other. My 
last question is: do you think this refinement is necessary, or is the "whole 
window at once" approach good enough? Would you expect the difference to be 
significant?


Thank you very much for taking the time to read this VERY long e-mail. Any 
suggestions/opinions/feedback are much appreciated.


All the best,


Thalita


To view the terms under which this email is distributed, please go to:-
http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html

Reply via email to