RE: using spark to load a data warehouse in real time

Adaryl Wakefield Sat, 04 Mar 2017 11:33:06 -0800

That does thanks. I’m starting to think a straight Kafka solution would be more 
appropriate.


Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Sam Elamin [mailto:hussam.ela...@gmail.com]
Sent: Wednesday, March 1, 2017 2:29 AM
To: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>; Jörn Franke 
<jornfra...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl

Having come from a Web background myself I completely understand your confusion 
so let me try to clarify a few things

First and foremost, Spark is a data processing engine not a general framework. 
In the Web applications and frameworks world you load the entities, map them to 
the UI and serve them up to the users then save whatever you need to back to 
the database via some sort of entity mapping. Whether that's an orm or a stored 
procedures or any other manner

Spark as I mentioned is a data processing engine so there Is no concept of an 
orm or data mapper. You can give it the schema of what you expect the data to 
like like, it also works well with most of the data formats being used in the 
industry like CSV,JSON,AVRO and PARQUET including infering the schema from the 
data provided making it much easier to develop and maintain

Now as to your question of loading data in real time it absolutely can be done. 
Traditionally data coming in arrives at a location most people call the 
landing. This is where the extract of the etl part begins.

As Jorn mention spark streaming isn't meant to write to a database but you can 
write to kafka or kinesis to write to a pipeline then have another process call 
them and write to your end datastore.

 The creators of spark realised that you're use case is absolutely valid and 
almost everyone they talked to said that streaming on its own wasn't enough, 
for this very same reason the concept of structured streaming was brought in 
place.

Se  this blog post from databricks

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html


You can potentially use the structured streaming APIs to continually read 
changes from hdfs or in your case S3 then write it out via jdbc to your end 
datastore

I have done it before so I'll give you a few gotchas to be aware of

The most important one is that your end datastore or data warehouse supports 
streaming inserts, some are better than others. Redshift specifically is really 
bad when it comes to small very frequent deltas which is what streaming at high 
scale is

The second is that the structured streaming is still in alpha phase and the 
code is marked as experimental, that's not to say it will die the minute you 
push any load through because I found that it handled Gbs of data well. The 
pains I found is that the underlying goal of structured streaming was to use 
the underlying dataframe APIs hence unifying the batch and stream data types 
meaning you only need to learn one. However some methods don't yet work on the 
streaming dataframes such as dropDuplicates


That's pretty much it. So really it comes down to you're use case, if you need 
the data to be reliable and never go down then implement kafka or Kinesis. If 
it's a proof of concept or you are trying to validate a theory use structured 
streaming as it's much quicker to write, weeks and months of set up vs a few 
hours


I hope I clarified things for you

Regards
Sam

Sent from my iPhone




On Wed, 1 Mar 2017 at 07:34, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
I am not sure that Spark Streaming is what you want to do. It is for streaming 
analytics not for loading in a DWH.

You need also define what realtime means and what is needed there - it will 
differ from client to client significantly.

From my experience, just SQL is not enough for the users in the future. 
Especially large data volumes require much more beyond just aggregations. These 
may become less useful in context of large data volumes. They have to learn new 
ways of dealing with the data from a business perspective by employing proper 
sampling of data from a large dataset, machine learning approaches etc. These 
are new methods which are not technically driven but business driven. I think 
it is wrong to assume that users learning new skills is a bad thing; it might 
be in the future a necessity.

On 28 Feb 2017, at 23:18, Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>> wrote:
I’m actually trying to come up with a generalized use case that I can take from 
client to client. We have structured data coming from some application. Instead 
of dropping it into Hadoop and then using yet another technology to query that 
data, I just want to dump it into a relational MPP DW so nobody has to learn 
new skills or new tech just to do some analysis. Everybody and their mom can 
write SQL. Designing relational databases is a rare skill but not as rare as 
what is necessary for designing some NoSQL solutions.

I’m looking for the fastest path to move a company from batch to real time 
analytical processing.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Mohammad Tariq [mailto:donta...@gmail.com]
Sent: Tuesday, February 28, 2017 12:57 PM
To: Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC support 
through DataFrames. Could you please explain your use case a bit more? That'll 
help us in answering your query better.




[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti<http://about.me/mti>








Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>


<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of 
course have to be in any architecture but it looks like they are suggesting 
that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web 
development background so I have somewhat of an idea on how to insert data into 
a database from an application. I’ve since moved on to straight database 
programming and don’t work with anything that reads from an app anymore. 
<http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing 
keys to maintain referential integrity. Usually that’s done in a batch process. 
Now I have to do it record by record (or a few records). I have some ideas but 
I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the 
examples I’ve seen are just SELECT statements, no INSERTS or MERGE 
statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femib...@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real 
time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to 
Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. 
<http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have 
another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <adaryl.wakefi...@hotmail.com> 
wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in 
real time? There isn’t a lot of information on this use case out there. When I 
google real time data warehouse load, nothing I find is up to date. It’s all 
turn of the century stuff and doesn’t take into account advancements in 
database technology. Additionally, whenever I try to learn spark, it’s always 
the same thing. Play with twitter data never structured data. All the CEP uses 
cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should 
be possible. I was thinking Spark Streaming with Spark SQL along with a ORM 
should do it. Am I off base with this? Is the reason why there are no examples 
is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>

RE: using spark to load a data warehouse in real time

Reply via email to