Re: using spark to load a data warehouse in real time

Mohammad Tariq Tue, 28 Feb 2017 10:58:20 -0800

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC
support through DataFrames. Could you please explain your use case a bit
more? That'll help us in answering your query better.





[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
adaryl.wakefi...@hotmail.com> wrote:

> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
> of course have to be in any architecture but it looks like they are
> suggesting that Kafka is all you need.
>
>
>
> My primary concern is the complexity of loading warehouses. I have a web
> development background so I have somewhat of an idea on how to insert data
> into a database from an application. I’ve since moved on to straight
> database programming and don’t work with anything that reads from an app
> anymore.
>
>
>
> Loading a warehouse requires a lot of cleaning of data and running and
> grabbing keys to maintain referential integrity. Usually that’s done in a
> batch process. Now I have to do it record by record (or a few records). I
> have some ideas but I’m not quite there yet.
>
>
>
> I thought SparkSQL would be the way to get this done but so far, all the
> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
> statements.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* Femi Anthony [mailto:femib...@gmail.com]
> *Sent:* Tuesday, February 28, 2017 4:13 AM
> *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: using spark to load a data warehouse in real time
>
>
>
> Have you checked to see if there are any drivers to enable you to write to
> Greenplum directly from Spark ?
>
>
>
> You can also take a look at this link:
>
>
>
> https://groups.google.com/a/greenplum.org/forum/m/#!topic/
> gpdb-users/lnm0Z7WBW6Q
>
>
>
> Apparently GPDB is based on Postgres so maybe that approach may work.
>
> Another approach maybe for Spark Streaming to write to Kafka, and then
> have another process read from Kafka and write to Greenplum.
>
>
>
> Kafka Connect may be useful in this case -
>
>
>
> https://www.confluent.io/blog/announcing-kafka-connect-
> building-large-scale-low-latency-data-pipelines/
>
>
>
> Femi Anthony
>
>
>
>
>
>
> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
> adaryl.wakefi...@hotmail.com> wrote:
>
> Is anybody using Spark streaming/SQL to load a relational data warehouse
> in real time? There isn’t a lot of information on this use case out there.
> When I google real time data warehouse load, nothing I find is up to date.
> It’s all turn of the century stuff and doesn’t take into account
> advancements in database technology. Additionally, whenever I try to learn
> spark, it’s always the same thing. Play with twitter data never structured
> data. All the CEP uses cases are about data science.
>
>
>
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
> should be possible. I was thinking Spark Streaming with Spark SQL along
> with a ORM should do it. Am I off base with this? Is the reason why there
> are no examples is because there is a better way to do what I want?
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
>

Re: using spark to load a data warehouse in real time

Reply via email to