Hey,

I have successfully integrated Flink into our very small test cluster (3 
machines with 8 cores, 8GBytes of memory and 2x1TB disks). Basically I am 
started the session to use YARN as RM and the data is being read from HDFS.
/yarn-session.sh -n 21 -s 1 -jm 1024 -tm 1024

My code is very simple, flatMap is being done on the CSV data, so I extract the 
signal name and value, I group by signal name and performing group reduce on 
the data in order to calculate max, min and average on the collected values.

I have observed on 3 nodes, the average processing rate is around 
11Mbytes/second. I have compared the results with MR execution(without any kind 
of tuning) and I am quite surprised, since the performance of Hadoop is 
85Mybtes/second when executing the same query on the same data. I have read few 
reports claiming that Flink is better in comparison to MR and other tools. I am 
wondering what is wrong? Any clue?

The processing rate is calculated according to the following formula:
Overall processing rate = sum of total amount of data read per job/sum of total 
time the job was running (including staging periods)

Best regards,
Serhiy.

Reply via email to