Hello. I have a process (python) that reads a kafka queue, for each record it checks in a table.
# Load table in memory table=sqlContext.sql("select id from table") table.cache() kafkaTopic.foreachRDD(processForeach) def processForeach (time, rdd): print(time) for k in rdd.collect (): if (table.filter("id =' %s'" % k["id"]).count()>0): print (k) The problem is that little by little spark time is lagging behind, I can see it in the "print(time)" output. the kafka topic with a maximum of 3 messages per second.