Hi,

I'm currently working on a following use case:
I have lots of events, each of them have userId, createTime,
visitStartDate(initially empty) and many different fields. I would like to
use spark streaming to tag those events with visit start date. Two events
form a visit if:
1. they have the same userId
2. the interval between their create time is no longer than 30 minutes. 
3. interval is calculated from the last event in the current visit.

I have a batch job which is doing this. It simply maintains a map of active
visits(userId, (visitStartDate, lastEventCreationTime)). When the event is
analysed the job checks if the userId is already known and if it is related
to an existing visit. if yes event is tagged with the obtained
visitStartDate otherwise it is tagged with new visitStartDate and this new
visitStartDate is added to the map together with userId and with 
createTime.

I would like to have similar behavior in the spark streaming job, but I am
not sure if this is a good approach. I would like to know what to do in such
a case. Any advises ?
Thanks in advance.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-maintain-huge-dataset-while-using-spark-streaming-tp23242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to