Hi, We recently migrated our streaming jobs to the direct kafka receiver. Our initial migration went quite fine but now we are seeing a weird zig-zag performance pattern we cannot explain. In alternating fashion, one task takes about 1 second to finish and the next takes 7sec for a stable streaming rate.
Here are comparable metrics for two successive tasks: *Slow*: Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded Tasks 20151006-044141-2408867082-5050-21047-S0dnode-3.hdfs.private:3686322 s303 20151006-044141-2408867082-5050-21047-S1dnode-0.hdfs.private:4381240 s11011 20151006-044141-2408867082-5050-21047-S4dnode-5.hdfs.private:5994549 s10010 *Fast*: Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded Tasks 20151006-044141-2408867082-5050-21047-S0dnode-3.hdfs.private:368630.6 s404 20151006-044141-2408867082-5050-21047-S1dnode-0.hdfs.private:438121 s909 20151006-044141-2408867082-5050-21047-S4dnode-5.hdfs.private:599451 s11011 We have some custom metrics that measure wall-clock time of execution of certain blocks of the job, like the time it takes to do the local computations (RDD.foreachPartition closure) vs total time. The difference between the slow and fast executing task is on the 'spark computation time' which is wall-clock for the task scheduling (DStream.foreachRDD closure) e.g. Slow task: local computation time: 347.60968499999996, *spark computation time: 6930*, metric collection: 70, total process: 7000, total_records: 4297 Fast task: local computation time: 281.539042,* spark computation time: 263*, metric collection: 138, total process: 401, total_records: 5002 We are currently running Spark 1.4.1. The load and the work to be done is stable -this is on a dev env with that stuff under control. Any ideas what this behavior could be? thanks in advance, Gerard.