Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay:
I am using spark-shell because I am still prototyping the steps involved.
Regarding executors - I have 280 executors and UI only show a few straggler 
tasks on each trigger.  The UI does not show too much time spend on GC.  
suspect the delay is because of getting data from kafka. The number of 
straggler is generally less than 5 out 240 but sometimes is higher. 

I will try to dig more into it and see if changing partitions etc helps but was 
wondering if anyone else has encountered similar stragglers holding up 
processing of a window trigger.
Thanks
 

On Friday, February 23, 2018 6:07 PM, vijay.bvp  wrote:
 

 Instead of spark-shell have you tried running it as a job. 

how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of  the tasks taking more time was they are
any GC 

please look at the UI if not already it can provide lot of information



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



   

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. 

how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of  the tasks taking more time was they are
any GC 

please look at the UI if not already it can provide lot of information



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka 
(0.11).  

I need to aggregate data ingested every minute and I am using spark-shell at 
the moment.  The message rate ingestion rate is approx 500k/second.  During 
some trigger intervals (1 minute) especially when the streaming process is 
started, all tasks finish in 20seconds but during some triggers, it takes 90 
seconds.  

I have tried to reduce the number of partitions approx (100 from 300) to reduce 
the consumers for Kafka, but that has not helped. I also tried the 
kafkaConsumer.pollTimeoutMs to 30 seconds but then I see a lot of 
java.util.concurrent.TimeoutException: Cannot fetch record for offset.
So I wanted to see if anyone has any thoughts/recommendations.
Thanks