Have you taken a look at the join section in the streaming programming guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior < rendy.b.jun...@gmail.com> wrote: > Let say I have transaction data and visit data > > visit > | userId | Visit source | Timestamp | > | A | google ads | 1 | > | A | facebook ads | 2 | > > transaction > | userId | total price | timestamp | > | A | 100 | 248384 | > | B | 200 | 43298739 | > > I want to join transaction data and visit data to do sales attribution. I > want to do it realtime whenever transaction occurs (streaming). > > Is it scalable to do join between one data and very big historical data > using join function in spark? If it is not, then how it usually be done? > > Visit needs to be historical, since visit can be anytime before > transaction (e.g. visit is one year before transaction occurs) > > Rendy >