You could try Delta Lake or Apache Hudi for this use case. On Sat, Jan 9, 2021 at 12:32 PM András Kolbert <kolbertand...@gmail.com> wrote:
> Sorry if my terminology is misleading. > > What I meant under driver only is to use a local pandas dataframe (collect > the data to the master), and keep updating that instead of dealing with a > spark distributed dataframe for holding this data. > > For example, we have a dataframe with all users and their corresponding > latest activity timestamp. After each streaming batch, aggregations are > performed and the calculation is collected to the driver to update a subset > of users latest activity timestamp. > > > > On Sat, 9 Jan 2021, 6:18 pm Artemis User, <arte...@dtechspace.com> wrote: > >> Could you please clarify what do you mean by 1)? Driver is only >> responsible for submitting Spark job, not performing. >> >> -- ND >> >> On 1/9/21 9:35 AM, András Kolbert wrote: >> > Hi, >> > I would like to get your advice on my use case. >> > I have a few spark streaming applications where I need to keep >> > updating a dataframe after each batch. Each batch probably affects a >> > small fraction of the dataframe (5k out of 200k records). >> > >> > The options I have been considering so far: >> > 1) keep dataframe on the driver, and update that after each batch >> > 2) keep dataframe distributed, and use checkpointing to mitigate lineage >> > >> > I solved previous use cases with option 2, but I am not sure if it is >> > the most optimal as checkpointing is relatively expensive. I also >> > wondered about HBASE or some sort of quick access memory storage, >> > however it is currently not in my stack. >> > >> > Curious to hear your thoughts >> > >> > Andras >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>