I’m actually planning on using the in memory database MemSQL. Creating a file 
then ingesting it seems like we’re back to batch processing. I know the 
definition of real time varies and any improvement over 24 hours is a good 
thing but I’d like to get as close to the actual event happing as possible. 

I’ve been studying Storm, Samza, and Spark Streaming. The literature says that 
Storm is good for ETL but I’ve also read that the trident abstraction has a 
large negative impact on throughput. 

So MemSQL boast rapid ingestion. Back to my original question. The method for 
loading data really is just a run of the mill INSERT statement? No other magic 
used than that?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Palmer, Cliff A. (NE) 
Sent: Saturday, March 07, 2015 10:44 AM
To: [email protected] 
Subject: RE: real time warehouse loads

Bob, if "real time" means "up to a few minutes is acceptable" then I'd 
recommend you use storm to do any pre-load processing and write the result to a 
text/csv/etc file in a directory.  Then use a seperate utility (most databases 
have something that does this) to load data from the files you create into the 
database.

This sounds slower, but remember that establishing a connection to a database 
to run a SQL INSERT has noticable latency.  It's also true that each connection 
(usually) takes a port/socket, memory and is often a seperate OS task so you 
are consuming resources that you would probably want storm using.

There are other solutions for something closer to real time, but they require 
an in-memory database or "fun with caching" which will require specialized 
expertise.

HTH




--------------------------------------------------------------------------------

From: Adaryl "Bob" Wakefield, MBA [[email protected]]
Sent: Friday, March 06, 2015 7:54 PM
To: [email protected]
Subject: real time warehouse loads


I’m looking at storm as a method to load data warehouses in real time. I am not 
that familiar with Java. I’m curious about the actual mechanism to load records 
into tables. Is it just a matter of feeding the final result of processing into 
a INSERT INTO SQL statement or is it more complicated than that? It seems to me 
that hammering the database with SQL statements of real time data is a bit 
inefficient.  

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

Reply via email to