Hi, I'm currently working on a following use case: I have lots of events, each of them have userId, createTime, visitStartDate(initially empty) and many different fields. I would like to use spark streaming to tag those events with visit start date. Two events form a visit if: 1. they have the same userId 2. the interval between their create time is no longer than 30 minutes. 3. interval is calculated from the last event in the current visit.
I have a batch job which is doing this. It simply maintains a map of active visits(userId, (visitStartDate, lastEventCreationTime)). When the event is analysed the job checks if the userId is already known and if it is related to an existing visit. if yes event is tagged with the obtained visitStartDate otherwise it is tagged with new visitStartDate and this new visitStartDate is added to the map together with userId and with createTime. I would like to have similar behavior in the spark streaming job, but I am not sure if this is a good approach. I would like to know what to do in such a case. Any advises ? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-maintain-huge-dataset-while-using-spark-streaming-tp23242.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org