Re: [Spark-SQL] : Incremental load in Pyspark

Matt Deaver Tue, 11 Apr 2017 14:56:06 -0700

It's pretty simple, really: you would run your processing job as much as
you want during the week then when loading into the base table do a window
function based on the primary key(s) and order by the updated time column,
then delete the existing rows with those pks and load that data.


On Tue, Apr 11, 2017 at 2:08 PM, Vamsi Makkena <kv.makk...@gmail.com> wrote:

> Hi Matt,
>
> Thanks for your reply.
>
> I will get updates regularly but I want to load the updated data once in a
> week. Staging table may solve this issue, but I'm looking for how row
> updated time should include in the query.
>
> Thanks
>
> On Tue, Apr 11, 2017 at 2:59 PM Matt Deaver <mattrdea...@gmail.com> wrote:
>
>> Do you have updates coming in on your data flow? If so, you will need a
>> staging table and a merge process into your Teradata tables.
>>
>> If you do not have updated rows aka your Teradata tables are append-only
>> you can process data and insert (bulk load) into Teradata.
>>
>> I don't have experience doing this directly in Spark, though, but
>> according to this post https://community.hortonworks.
>> com/questions/63826/hi-is-there-any-connector-for-
>> teradata-to-sparkwe.html you will need to use a JDBC driver to connect.
>>
>> On Tue, Apr 11, 2017 at 1:23 PM, Vamsi Makkena <kv.makk...@gmail.com>
>> wrote:
>>
>> I am reading the data from Oracle tables and Flat files (new excel file
>> every week) and write it to Teradata weekly using Pyspark.
>>
>> In the initial run it will load the all the data to Teradata. But in the
>> later runs I just want to read the new records from Oracle and Flatfiles
>> and want to append it to teradata tables.
>>
>> How can I do this using Pyspark, without touching the oracle and teradata
>> tables?
>>
>> Please post the sample code if possible.
>>
>> Thanks
>>
>>
>>
>>
>> --
>> Regards,
>>
>> Matt
>> Data Engineer
>> https://www.linkedin.com/in/mdeaver
>> http://mattdeav.pythonanywhere.com/
>>
>


-- 
Regards,

Matt
Data Engineer
https://www.linkedin.com/in/mdeaver
http://mattdeav.pythonanywhere.com/

Re: [Spark-SQL] : Incremental load in Pyspark

Reply via email to