Hi spark users,

In our Spark Streaming app via Kafka integration on Mesos, we initialed 3
receivers to receive 3 Kafka partitions, whereas records receiving rate
imbalance been observed, with spark.streaming.receiver.maxRate is set to 120,
sometimes 1 of which receives very close to the limit while the other two
only at roughly fifty per second.

This may be caused by previous receiver failure, where one of the
receivers’ receiving rate drop to 0. We restarted the Spark Streaming app,
and the imbalance began. We suspect that the partition which received by
the failing receiver got jammed, and the other two receivers cannot take up
its data.

The 3-nodes cluster tends to run slowly, nearly all the tasks is registered
at the node with previous receiver failure(I used unionto combine 3
receivers’ DStream, thus I expect the combined DStream is well distributed
across all nodes), cannot guarantee to finish one batch in a single batch
time, stages get piled up, and the digested log shows as following:

...
5728.399: [GC (Allocation Failure) [PSYoungGen:
6954678K->17088K(6961152K)] 7074614K->138108K(20942336K), 0.0203877
secs] [Times: user=0.20 sys=0.00, real=0.02 secs]

...
5/09/22 13:33:35 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 77219 because its task set is gone (this is likely
the result of
receiving duplicate task finished status updates)

...

the two type of log was printed in execution of some (not all) stages.

My configurations:
# of cores on each node: 64
# of nodes: 3
batch time is set to 10 seconds

spark.streaming.receiver.maxRate        120
spark.streaming.blockInterval           160  // set to the value that
divides 10 seconds approx. to  total cores, which is 64, to max out
all the nodes: 10s * 1000 / 64
spark.storage.memoryFraction            0.1  // this one doesn't seem
to work, since the young gen / old gen ratio is nearly 0.3 instead of
0.1

anyone got an idea? Appreciate for your patience.

BR,
Todd Leo
​

Reply via email to