Hi spark users, In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 receivers to receive 3 Kafka partitions, whereas records receiving rate imbalance been observed, with spark.streaming.receiver.maxRate is set to 120, sometimes 1 of which receives very close to the limit while the other two only at roughly fifty per second.
This may be caused by previous receiver failure, where one of the receivers’ receiving rate drop to 0. We restarted the Spark Streaming app, and the imbalance began. We suspect that the partition which received by the failing receiver got jammed, and the other two receivers cannot take up its data. The 3-nodes cluster tends to run slowly, nearly all the tasks is registered at the node with previous receiver failure(I used unionto combine 3 receivers’ DStream, thus I expect the combined DStream is well distributed across all nodes), cannot guarantee to finish one batch in a single batch time, stages get piled up, and the digested log shows as following: ... 5728.399: [GC (Allocation Failure) [PSYoungGen: 6954678K->17088K(6961152K)] 7074614K->138108K(20942336K), 0.0203877 secs] [Times: user=0.20 sys=0.00, real=0.02 secs] ... 5/09/22 13:33:35 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 77219 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) ... the two type of log was printed in execution of some (not all) stages. My configurations: # of cores on each node: 64 # of nodes: 3 batch time is set to 10 seconds spark.streaming.receiver.maxRate 120 spark.streaming.blockInterval 160 // set to the value that divides 10 seconds approx. to total cores, which is 64, to max out all the nodes: 10s * 1000 / 64 spark.storage.memoryFraction 0.1 // this one doesn't seem to work, since the young gen / old gen ratio is nearly 0.3 instead of 0.1 anyone got an idea? Appreciate for your patience. BR, Todd Leo