[PySpark] Understanding the times reported by PythonRunner

2019-11-20 Thread Valerie Hayot
Hi, I'm investigating the performance of my application and am trying to gain a better understanding of what the boot, init and finish times reported by the PythonRunner signify. This is an example of output I have obtained: 19/10/26 12:14:24 INFO PythonRunner: Times: total = 35412, boot =

Spark onApplicationEnd run multiple times during the application failure

2019-11-20 Thread Jiang, Yi J (CWM-NR)
Hello We are running into an issue. We have customized the SparkListener class, and added that to spark context. But when the spark job is failed, we find the "onApplicationEnd" function was triggered twice. Is that supposed to be triggered just once when the spark job is failed? Because the

Re: Structured Streaming Kafka change maxOffsetsPerTrigger won't apply

2019-11-20 Thread Gabor Somogyi
Hi Roland, Not much shared apart from it's not working. Latest partition offset is used when the size of a TopicPartition is negative. This can be found out by checking the following log entry in the logs: logDebug(s"rateLimit $tp size is $size") If you've double checked and still think it's an

Structured Streaming Kafka change maxOffsetsPerTrigger won't apply

2019-11-20 Thread Roland Johann
Hi All, changing maxOffsetsPerTrigger and restarting the job won’t apply to the batch size. This is somehow bad as we currently use a trigger duration of 5minutes which consumes only 100k messages with an offset lag in the billions. Decreasing trigger duration affects also micro batch size -