Cascading Spark Structured streams

2017-12-28 Thread Eric Dain
I need to write a Spark Structured Streaming pipeline that involves multiple aggregations, splitting data into multiple sub-pipes and union them. Also it need to have stateful aggregation with timeout. Spark Structured Streaming support all of the required functionality but not as one stream. I

Ingesting Large csv File to relational database

2017-01-25 Thread Eric Dain
Hi, I need to write nightly job that ingest large csv files (~15GB each) and add/update/delete the changed rows to relational database. If a row is identical to what in the database, I don't want to re-write the row to the database. Also, if same item comes from multiple sources (files) I need