You could try Delta Lake or Apache Hudi for this use case.

On Sat, Jan 9, 2021 at 12:32 PM András Kolbert <kolbertand...@gmail.com>
wrote:

> Sorry if my terminology is misleading.
>
> What I meant under driver only is to use a local pandas dataframe (collect
> the data to the master), and keep updating that instead of dealing with a
> spark distributed dataframe for holding this data.
>
> For example, we have a dataframe with all users and their corresponding
> latest activity timestamp. After each streaming batch, aggregations are
> performed and the calculation is collected to the driver to update a subset
> of users latest activity timestamp.
>
>
>
> On Sat, 9 Jan 2021, 6:18 pm Artemis User, <arte...@dtechspace.com> wrote:
>
>> Could you please clarify what do you mean by 1)? Driver is only
>> responsible for submitting Spark job, not performing.
>>
>> -- ND
>>
>> On 1/9/21 9:35 AM, András Kolbert wrote:
>> > Hi,
>> > I would like to get your advice on my use case.
>> > I have a few spark streaming applications where I need to keep
>> > updating a dataframe after each batch. Each batch probably affects a
>> > small fraction of the dataframe (5k out of 200k records).
>> >
>> > The options I have been considering so far:
>> > 1) keep dataframe on the driver, and update that after each batch
>> > 2) keep dataframe distributed, and use checkpointing to mitigate lineage
>> >
>> > I solved previous use cases with option 2, but I am not sure if it is
>> > the most optimal as checkpointing is relatively expensive. I also
>> > wondered about HBASE or some sort of quick access memory storage,
>> > however it is currently not in my stack.
>> >
>> > Curious to hear your thoughts
>> >
>> > Andras
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to