You could try this as a blueprint : Read the data in through Spark Streaming. Iterate over it and convert each RDD into a DataFrame. Use these DataFrames to perform whatever processing is required and then save that DataFrame into your target relational warehouse.
HTH [image: --] Tariq, Mohammad [image: https://]about.me/mti <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext> [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti> On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <donta...@gmail.com> wrote: > Hi Adaryl, > > You could definitely load data into a warehouse through Spark's JDBC > support through DataFrames. Could you please explain your use case a bit > more? That'll help us in answering your query better. > > > > > [image: --] > > Tariq, Mohammad > [image: https://]about.me/mti > > <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext> > > > > > [image: http://] > > Tariq, Mohammad > about.me/mti > [image: http://] > <http://about.me/mti> > > > On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield < > adaryl.wakefi...@hotmail.com> wrote: > >> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, >> of course have to be in any architecture but it looks like they are >> suggesting that Kafka is all you need. >> >> >> >> My primary concern is the complexity of loading warehouses. I have a web >> development background so I have somewhat of an idea on how to insert data >> into a database from an application. I’ve since moved on to straight >> database programming and don’t work with anything that reads from an app >> anymore. >> >> >> >> Loading a warehouse requires a lot of cleaning of data and running and >> grabbing keys to maintain referential integrity. Usually that’s done in a >> batch process. Now I have to do it record by record (or a few records). I >> have some ideas but I’m not quite there yet. >> >> >> >> I thought SparkSQL would be the way to get this done but so far, all the >> examples I’ve seen are just SELECT statements, no INSERTS or MERGE >> statements. >> >> >> >> Adaryl "Bob" Wakefield, MBA >> Principal >> Mass Street Analytics, LLC >> 913.938.6685 >> >> www.massstreet.net >> >> www.linkedin.com/in/bobwakefieldmba >> Twitter: @BobLovesData >> >> >> >> *From:* Femi Anthony [mailto:femib...@gmail.com] >> *Sent:* Tuesday, February 28, 2017 4:13 AM >> *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com> >> *Cc:* user@spark.apache.org >> *Subject:* Re: using spark to load a data warehouse in real time >> >> >> >> Have you checked to see if there are any drivers to enable you to write >> to Greenplum directly from Spark ? >> >> >> >> You can also take a look at this link: >> >> >> >> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gp >> db-users/lnm0Z7WBW6Q >> >> >> >> Apparently GPDB is based on Postgres so maybe that approach may work. >> >> Another approach maybe for Spark Streaming to write to Kafka, and then >> have another process read from Kafka and write to Greenplum. >> >> >> >> Kafka Connect may be useful in this case - >> >> >> >> https://www.confluent.io/blog/announcing-kafka-connect-build >> ing-large-scale-low-latency-data-pipelines/ >> >> >> >> Femi Anthony >> >> >> >> >> >> >> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield < >> adaryl.wakefi...@hotmail.com> wrote: >> >> Is anybody using Spark streaming/SQL to load a relational data warehouse >> in real time? There isn’t a lot of information on this use case out there. >> When I google real time data warehouse load, nothing I find is up to date. >> It’s all turn of the century stuff and doesn’t take into account >> advancements in database technology. Additionally, whenever I try to learn >> spark, it’s always the same thing. Play with twitter data never structured >> data. All the CEP uses cases are about data science. >> >> >> >> I’d like to use Spark to load Greenplumb in real time. Intuitively, this >> should be possible. I was thinking Spark Streaming with Spark SQL along >> with a ORM should do it. Am I off base with this? Is the reason why there >> are no examples is because there is a better way to do what I want? >> >> >> >> Adaryl "Bob" Wakefield, MBA >> Principal >> Mass Street Analytics, LLC >> 913.938.6685 >> >> www.massstreet.net >> >> www.linkedin.com/in/bobwakefieldmba >> Twitter: @BobLovesData >> >> >> >> >