Re: using spark to load a data warehouse in real time

Femi Anthony Tue, 28 Feb 2017 02:12:50 -0800

Have you checked to see if there are any drivers to enable you to write to 
Greenplum directly from Spark ?


You can also take a look at this link:

https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q

Apparently GPDB is based on Postgres so maybe that approach may work. 
Another approach maybe for Spark Streaming to write to Kafka, and then have 
another process read from Kafka and write to Greenplum.

Kafka Connect may be useful in this case -

https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/

Femi Anthony



> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <adaryl.wakefi...@hotmail.com> 
> wrote:
> 
> Is anybody using Spark streaming/SQL to load a relational data warehouse in 
> real time? There isn’t a lot of information on this use case out there. When 
> I google real time data warehouse load, nothing I find is up to date. It’s 
> all turn of the century stuff and doesn’t take into account advancements in 
> database technology. Additionally, whenever I try to learn spark, it’s always 
> the same thing. Play with twitter data never structured data. All the CEP 
> uses cases are about data science.
>  
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this 
> should be possible. I was thinking Spark Streaming with Spark SQL along with 
> a ORM should do it. Am I off base with this? Is the reason why there are no 
> examples is because there is a better way to do what I want?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>

Re: using spark to load a data warehouse in real time

Reply via email to