Hi all, We are having a few issues with the performance of updateStateByKey operation in Spark Streaming (1.2.1 at the moment) and any advice would be greatly appreciated. Specifically, on each tick of the system (which is set at 10 secs) we need to update a state tuple where the key is the user_id and value an object with some state about the user. The problem is that using Kryo serialization for 5M users, this gets really slow to the point that we have to increase the period to more than 10 seconds so as not to fall behind. The input for the streaming job is a Kafka stream which is consists of key value pairs of user_ids with some sort of action codes, we join this to our checkpointed state key and update the state. I understand that the reason for iterating over the whole state set is for evicting items or updating state for everyone for time-depended computations but this does not apply on our situation and it hurts performance really bad. Is there a possibility of implementing in the future and extra call in the API for updating only a specific subset of keys?
p.s. i will try asap to setting the dstream as non-serialized but then i am worried about GC and checkpointing performance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-performance-API-tp22113.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org