Hi, What happens if I dont specify checkpointing on a DStream that has reduceByKeyAndWindow with no inverse function? Would it cause the memory to be overflown? My window sizes are 1 hour and 24 hours. I cannot provide an inserse function for this as it is based on HyperLogLog.
My code looks like something like the following: val logsByPubGeo = messages.map(_._2).filter(_.geo != Constants.UnknownGeo).map { log => val key = PublisherGeoKey(log.publisher, log.geo) val agg = AggregationLog( timestamp = log.timestamp, sumBids = log.bid, imps = 1, uniquesHll = hyperLogLog(log.cookie.getBytes(Charsets.UTF_8)) ) (key, agg) } val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs, BatchDuration) private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2: AggregationLog) = { aggLog1.copy( timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp), sumBids = aggLog1.sumBids + aggLog2.sumBids, imps = aggLog1.imps + aggLog2.imps, uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll ) } Please let me know. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-fro-reduceByKeyAndWindow-with-a-window-size-of-1-hour-and-24-hours-tp28722.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org