Hi Adaryl, You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.
[image: --] Tariq, Mohammad [image: https://]about.me/mti <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext> [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield < adaryl.wakefi...@hotmail.com> wrote: > I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, > of course have to be in any architecture but it looks like they are > suggesting that Kafka is all you need. > > > > My primary concern is the complexity of loading warehouses. I have a web > development background so I have somewhat of an idea on how to insert data > into a database from an application. I’ve since moved on to straight > database programming and don’t work with anything that reads from an app > anymore. > > > > Loading a warehouse requires a lot of cleaning of data and running and > grabbing keys to maintain referential integrity. Usually that’s done in a > batch process. Now I have to do it record by record (or a few records). I > have some ideas but I’m not quite there yet. > > > > I thought SparkSQL would be the way to get this done but so far, all the > examples I’ve seen are just SELECT statements, no INSERTS or MERGE > statements. > > > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics, LLC > 913.938.6685 > > www.massstreet.net > > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > > > *From:* Femi Anthony [mailto:femib...@gmail.com] > *Sent:* Tuesday, February 28, 2017 4:13 AM > *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com> > *Cc:* user@spark.apache.org > *Subject:* Re: using spark to load a data warehouse in real time > > > > Have you checked to see if there are any drivers to enable you to write to > Greenplum directly from Spark ? > > > > You can also take a look at this link: > > > > https://groups.google.com/a/greenplum.org/forum/m/#!topic/ > gpdb-users/lnm0Z7WBW6Q > > > > Apparently GPDB is based on Postgres so maybe that approach may work. > > Another approach maybe for Spark Streaming to write to Kafka, and then > have another process read from Kafka and write to Greenplum. > > > > Kafka Connect may be useful in this case - > > > > https://www.confluent.io/blog/announcing-kafka-connect- > building-large-scale-low-latency-data-pipelines/ > > > > Femi Anthony > > > > > > > On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield < > adaryl.wakefi...@hotmail.com> wrote: > > Is anybody using Spark streaming/SQL to load a relational data warehouse > in real time? There isn’t a lot of information on this use case out there. > When I google real time data warehouse load, nothing I find is up to date. > It’s all turn of the century stuff and doesn’t take into account > advancements in database technology. Additionally, whenever I try to learn > spark, it’s always the same thing. Play with twitter data never structured > data. All the CEP uses cases are about data science. > > > > I’d like to use Spark to load Greenplumb in real time. Intuitively, this > should be possible. I was thinking Spark Streaming with Spark SQL along > with a ORM should do it. Am I off base with this? Is the reason why there > are no examples is because there is a better way to do what I want? > > > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics, LLC > 913.938.6685 > > www.massstreet.net > > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > > >