Re: real time warehouse loads

P. Taylor Goetz Sat, 07 Mar 2015 13:52:36 -0800

What you've read about trident's throughput is wrong. Of course it depends on 
what you actually do in your topology (it's possible to shoot yourself in the 
foot and kill performance with both the core and trident APIs), but trident can 
achieve nearly twice the throughput as a core api topology at the expense of 
additional latency.


In a simple benchmark I ran on ec2 with 3 supervisor nodes (m1.large -- not a 
huge amount of horsepower), the core storm topology was able to process about 
150k tuples per second with 80ms. latency, while trident processed about 300k 
tuples per second with about 250ms. latency.
(The topologies were tuned to balance throughput and latency.)

Trident performance isn't any worse than the core API, but it is different 
owing to the fact that it processes data in micro batches.

-Taylor
> On Mar 7, 2015, at 3:41 PM, Adaryl Bob Wakefield, MBA 
> <adaryl.wakefi...@hotmail.com> wrote:
> 
> I’m actually planning on using the in memory database MemSQL. Creating a file 
> then ingesting it seems like we’re back to batch processing. I know the 
> definition of real time varies and any improvement over 24 hours is a good 
> thing but I’d like to get as close to the actual event happing as possible.
>  
> I’ve been studying Storm, Samza, and Spark Streaming. The literature says 
> that Storm is good for ETL but I’ve also read that the trident abstraction 
> has a large negative impact on throughput.
>  
> So MemSQL boast rapid ingestion. Back to my original question. The method for 
> loading data really is just a run of the mill INSERT statement? No other 
> magic used than that?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Palmer, Cliff A. (NE)
> Sent: Saturday, March 07, 2015 10:44 AM
> To: user@storm.apache.org
> Subject: RE: real time warehouse loads
>  
> Bob, if "real time" means "up to a few minutes is acceptable" then I'd 
> recommend you use storm to do any pre-load processing and write the result to 
> a text/csv/etc file in a directory.  Then use a seperate utility (most 
> databases have something that does this) to load data from the files you 
> create into the database.
> This sounds slower, but remember that establishing a connection to a database 
> to run a SQL INSERT has noticable latency.  It's also true that each 
> connection (usually) takes a port/socket, memory and is often a seperate OS 
> task so you are consuming resources that you would probably want storm using.
> There are other solutions for something closer to real time, but they require 
> an in-memory database or "fun with caching" which will require specialized 
> expertise.
> HTH
>  
> From: Adaryl "Bob" Wakefield, MBA [adaryl.wakefi...@hotmail.com]
> Sent: Friday, March 06, 2015 7:54 PM
> To: user@storm.apache.org
> Subject: real time warehouse loads
> 
> I’m looking at storm as a method to load data warehouses in real time. I am 
> not that familiar with Java. I’m curious about the actual mechanism to load 
> records into tables. Is it just a matter of feeding the final result of 
> processing into a INSERT INTO SQL statement or is it more complicated than 
> that? It seems to me that hammering the database with SQL statements of real 
> time data is a bit inefficient.  
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData

Re: real time warehouse loads

Reply via email to