What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?

Roger Marin Mon, 29 Feb 2016 02:42:58 -0800

Hi all,

I have multiple (>100) jobs running concurrently (sharing the same hive
context) that are each appending new rows to the same dataframe registered
as a temp table.


Currently I am using unionAll and registering that dataframe again as a
temp table in each job:

Given an existing dataframe registered as the temp table "test":

//Create dataframe with new rows to append
val newRows = hiveContext.createDataframe (rows,schema)

//Retrieve existing dataframe and append the new dataframe via unionAll
val updatedDF=hiveContext.table("test").unionAll(newRows)

//uncache existing dataframe
hiveContext.uncacheTable("test")

//Register the updated DF as a temp table
updatedDF.registerTempTable("test")

//Cache the updated dataframe
hiveContext.table("test").cache

I am finding that using this approach can deplete memory very quickly since
each call to ".cache" in each of the jobs is creating a new entry in memory
for the same dataframe.

Does anyone know if theres a more optimal solution to the above?.

Thanks,
Roger

What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?

Reply via email to