Hey, I have successfully integrated Flink into our very small test cluster (3 machines with 8 cores, 8GBytes of memory and 2x1TB disks). Basically I am started the session to use YARN as RM and the data is being read from HDFS. /yarn-session.sh -n 21 -s 1 -jm 1024 -tm 1024
My code is very simple, flatMap is being done on the CSV data, so I extract the signal name and value, I group by signal name and performing group reduce on the data in order to calculate max, min and average on the collected values. I have observed on 3 nodes, the average processing rate is around 11Mbytes/second. I have compared the results with MR execution(without any kind of tuning) and I am quite surprised, since the performance of Hadoop is 85Mybtes/second when executing the same query on the same data. I have read few reports claiming that Flink is better in comparison to MR and other tools. I am wondering what is wrong? Any clue? The processing rate is calculated according to the following formula: Overall processing rate = sum of total amount of data read per job/sum of total time the job was running (including staging periods) Best regards, Serhiy.