Hi, I need to write nightly job that ingest large csv files (~15GB each) and add/update/delete the changed rows to relational database.
If a row is identical to what in the database, I don't want to re-write the row to the database. Also, if same item comes from multiple sources (files) I need to implement a logic to choose if the new source is preferred or the current one in the database should be kept unchanged. Obviously, I don't want to query the database for each item to check if the item has changed or no. I prefer to maintain the state inside Spark. Is there a preferred and performant way to do that using Apache Spark ? Best, Eric