Hi,

I need to write nightly job that ingest large csv files (~15GB each) and
add/update/delete the changed rows to relational database.

If a row is identical to what in the database, I don't want to re-write the
row to the database. Also, if same item comes from multiple sources (files)
I need to implement a logic to choose if the new source is preferred or the
current one in the database should be kept unchanged.

Obviously, I don't want to query the database for each item to check if the
item has changed or no. I prefer to maintain the state inside Spark.

Is there a preferred and performant way to do that using Apache Spark ?

Best,
Eric

Reply via email to