We did this all the time at my last position.

1. We had unstructured data in S3.

2.We read directly from S3 and then gave structure to the data by a dataframe in Spark.

3. We wrote the results to S3

4. We used Redshift's super fast parallel ability to load the results into a table.

Henry


On 02/28/2017 11:04 AM, Mohammad Tariq wrote:
You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each RDD into a DataFrame. Use these DataFrames to perform whatever processing is required and then save that DataFrame into your target relational warehouse.

HTH
--
                
Tariq, Mohammad
https://about.me/mti

<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>



http://
Tariq, Mohammad
about.me/mti

<http://about.me/mti>http://


On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <donta...@gmail.com <mailto:donta...@gmail.com>> wrote:

    Hi Adaryl,

    You could definitely load data into a warehouse through Spark's
    JDBC support through DataFrames. Could you please explain your use
    case a bit more? That'll help us in answering your query better.


--
    Tariq, Mohammad
    https://about.me/mti

    
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>




    http://
    Tariq, Mohammad
    about.me/mti

    <http://about.me/mti>http://


    On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield
    <adaryl.wakefi...@hotmail.com
    <mailto:adaryl.wakefi...@hotmail.com>> wrote:

        I haven’t heard of Kafka connect. I’ll have to look into it.
        Kafka would, of course have to be in any architecture but it
        looks like they are suggesting that Kafka is all you need.

        My primary concern is the complexity of loading warehouses. I
        have a web development background so I have somewhat of an
        idea on how to insert data into a database from an
        application. I’ve since moved on to straight database
        programming and don’t work with anything that reads from an
        app anymore.

        Loading a warehouse requires a lot of cleaning of data and
        running and grabbing keys to maintain referential integrity.
        Usually that’s done in a batch process. Now I have to do it
        record by record (or a few records). I have some ideas but I’m
        not quite there yet.

        I thought SparkSQL would be the way to get this done but so
        far, all the examples I’ve seen are just SELECT statements, no
        INSERTS or MERGE statements.

        Adaryl "Bob" Wakefield, MBA
        Principal
        Mass Street Analytics, LLC
        913.938.6685

        www.massstreet.net <http://www.massstreet.net>

        www.linkedin.com/in/bobwakefieldmba
        <http://www.linkedin.com/in/bobwakefieldmba>
        Twitter: @BobLovesData

        *From:*Femi Anthony [mailto:femib...@gmail.com
        <mailto:femib...@gmail.com>]
        *Sent:* Tuesday, February 28, 2017 4:13 AM
        *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com
        <mailto:adaryl.wakefi...@hotmail.com>>
        *Cc:* user@spark.apache.org <mailto:user@spark.apache.org>
        *Subject:* Re: using spark to load a data warehouse in real time

        Have you checked to see if there are any drivers to enable you
        to write to Greenplum directly from Spark ?

        You can also take a look at this link:

        
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q
        
<https://groups.google.com/a/greenplum.org/forum/m/#%21topic/gpdb-users/lnm0Z7WBW6Q>

        Apparently GPDB is based on Postgres so maybe that approach
        may work.

        Another approach maybe for Spark Streaming to write to Kafka,
        and then have another process read from Kafka and write to
        Greenplum.

        Kafka Connect may be useful in this case -

        
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
        
<https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/>

        Femi Anthony


        On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield
        <adaryl.wakefi...@hotmail.com
        <mailto:adaryl.wakefi...@hotmail.com>> wrote:

            Is anybody using Spark streaming/SQL to load a relational
            data warehouse in real time? There isn’t a lot of
            information on this use case out there. When I google real
            time data warehouse load, nothing I find is up to date.
            It’s all turn of the century stuff and doesn’t take into
            account advancements in database technology. Additionally,
            whenever I try to learn spark, it’s always the same thing.
            Play with twitter data never structured data. All the CEP
            uses cases are about data science.

            I’d like to use Spark to load Greenplumb in real time.
            Intuitively, this should be possible. I was thinking Spark
            Streaming with Spark SQL along with a ORM should do it. Am
            I off base with this? Is the reason why there are no
            examples is because there is a better way to do what I want?

            Adaryl "Bob" Wakefield, MBA
            Principal
            Mass Street Analytics, LLC
            913.938.6685

            www.massstreet.net <http://www.massstreet.net>

            www.linkedin.com/in/bobwakefieldmba
            <http://www.linkedin.com/in/bobwakefieldmba>
            Twitter: @BobLovesData




--
Henry Tremblay
Robert Half Technology

Reply via email to