You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each
RDD into a DataFrame. Use these DataFrames to perform whatever processing
is required and then save that DataFrame into your target relational
warehouse.

HTH


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <donta...@gmail.com> wrote:

> Hi Adaryl,
>
> You could definitely load data into a warehouse through Spark's JDBC
> support through DataFrames. Could you please explain your use case a bit
> more? That'll help us in answering your query better.
>
>
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
> adaryl.wakefi...@hotmail.com> wrote:
>
>> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
>> of course have to be in any architecture but it looks like they are
>> suggesting that Kafka is all you need.
>>
>>
>>
>> My primary concern is the complexity of loading warehouses. I have a web
>> development background so I have somewhat of an idea on how to insert data
>> into a database from an application. I’ve since moved on to straight
>> database programming and don’t work with anything that reads from an app
>> anymore.
>>
>>
>>
>> Loading a warehouse requires a lot of cleaning of data and running and
>> grabbing keys to maintain referential integrity. Usually that’s done in a
>> batch process. Now I have to do it record by record (or a few records). I
>> have some ideas but I’m not quite there yet.
>>
>>
>>
>> I thought SparkSQL would be the way to get this done but so far, all the
>> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
>> statements.
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>> *From:* Femi Anthony [mailto:femib...@gmail.com]
>> *Sent:* Tuesday, February 28, 2017 4:13 AM
>> *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: using spark to load a data warehouse in real time
>>
>>
>>
>> Have you checked to see if there are any drivers to enable you to write
>> to Greenplum directly from Spark ?
>>
>>
>>
>> You can also take a look at this link:
>>
>>
>>
>> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gp
>> db-users/lnm0Z7WBW6Q
>>
>>
>>
>> Apparently GPDB is based on Postgres so maybe that approach may work.
>>
>> Another approach maybe for Spark Streaming to write to Kafka, and then
>> have another process read from Kafka and write to Greenplum.
>>
>>
>>
>> Kafka Connect may be useful in this case -
>>
>>
>>
>> https://www.confluent.io/blog/announcing-kafka-connect-build
>> ing-large-scale-low-latency-data-pipelines/
>>
>>
>>
>> Femi Anthony
>>
>>
>>
>>
>>
>>
>> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
>> adaryl.wakefi...@hotmail.com> wrote:
>>
>> Is anybody using Spark streaming/SQL to load a relational data warehouse
>> in real time? There isn’t a lot of information on this use case out there.
>> When I google real time data warehouse load, nothing I find is up to date.
>> It’s all turn of the century stuff and doesn’t take into account
>> advancements in database technology. Additionally, whenever I try to learn
>> spark, it’s always the same thing. Play with twitter data never structured
>> data. All the CEP uses cases are about data science.
>>
>>
>>
>> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
>> should be possible. I was thinking Spark Streaming with Spark SQL along
>> with a ORM should do it. Am I off base with this? Is the reason why there
>> are no examples is because there is a better way to do what I want?
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>>
>

Reply via email to