Hi all, Anyone could help me on this. It's a bit urgent for me on this. I'm very confused and curious about Spark data checkpoint performance? Is there any detail implementation of checkpoint I can look into? Spark Streaming only take sub-second to process 20K messages/sec, however it take 25 seconds for checkpoint. Now my application have average 30 seconds latency and keep increasingly.
2015-11-06 11:11 GMT+07:00 Thúy Hằng Lê <thuyhang...@gmail.com>: > Thankd all, it would be great to have this feature soon. > Do you know what's the release plan for 1.6? > > In addition to this, I still have checkpoint performance problem > > My code is just simple like this: > JavaStreamingContext jssc = new > JavaStreamingContext(sparkConf,Durations.seconds(2)); > jssc.checkpoint("spark-data/checkpoint"); > JavaPairInputDStream<String, String> messages = > KafkaUtils.createDirectStream(...); > JavaPairDStream<String, List<Double>> stats = > messages.mapToPair(parseJson) > .reduceByKey(REDUCE_STATS) > .updateStateByKey(RUNNING_STATS); > > stats.print() > > Now I need to maintain about 800k keys, the stats here is only count > number of occurence for key. > While running the cache dir is very small (about 50M), my question is: > > 1/ For regular micro-batch it takes about 800ms to finish, but every 10 > seconds when data checkpoint is running > It took me 5 seconds to finish the same size micro-batch, why it's too > high? what's kind of job in checkpoint? > why it's keep increasing? > > 2/ When I changes the data checkpoint interval like using: > stats.checkpoint(Durations.seconds(100)); //change to 100, defaults > is 10 > > The checkpoint is keep increasing significantly first checkpoint is 10s, > second is 30s, third is 70s ... and keep increasing :) > Why it's too high when increasing checkpoint interval? > > It seems that default interval works more stable. > > On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: > >> Nice! Thanks for sharing, I wasn’t aware of the new API. >> >> Left some comments on the JIRA and design doc. >> >> -adrian >> >> From: Shixiong Zhu >> Date: Tuesday, November 3, 2015 at 3:32 AM >> To: Thúy Hằng Lê >> Cc: Adrian Tanase, "user@spark.apache.org" >> Subject: Re: Spark Streaming data checkpoint performance >> >> "trackStateByKey" is about to be added in 1.6 to resolve the performance >> issue of "updateStateByKey". You can take a look at >> https://issues.apache.org/jira/browse/SPARK-2629 and >> https://github.com/apache/spark/pull/9256 >> >