Re: Store previous calculated result

craig . charleton Mon, 09 Nov 2015 07:14:25 -0800

Stephen,

I originally looked at using the storm-jdbc external component but very quickly 
realized that it is only available in the Storm 10.x.  So, I looked at the 
source code for the storm-jdbc and it discussed using 
http://brettwooldridge.github.io/HikariCP/ as a high-performance connection 
pool for MySQL.  I have used JPA with Hibernate and EclipseLink before but I 
thought I would give HikariCP a try.  So, far it works really well.  However, I 
haven't deployed it into production yet.

JPA allows you to work with POJOs as entities and can be easier to code.  
However, I wanted to avoid any potential serialization issues that might arise 
in my system because I am already doing POJO->Avro->Kafka->Avro->Kryo->Storm.   

The way I am interacting with all of it together ( would share the code but it 
aint share-ready yet) is kind of in a pseudo-stateless manner.   I don't know 
it this will make sense but here-goes:

I assume that when a calculation is performed in a bolt that it will not be 
able to persist its state.  So, I persist values at places where I would ack a 
tuple in Storm.  Ultimately, my tuples come from Kafka topics.  Therefore, if I 
don't ack a tuple, it will get replayed in the case of a failure.  In some 
places I have a bolt output its product to a Kafka topic as well as write it 
somewhere in MySQL.  This allows me to break up a big topology into smaller 
topologies that have different performance needs, calculation frequencies, and 
characteristics without losing the speed and scalability.  (Think inbound data 
cleaning, filtering, transformation versus complex event processing)  

I am still working on elements of the whole solution.  However, it all adds up 
to Storm and Kafka are made for each other.  I am leveraging Kafka's speed, 
storage, scalability, etc to help Storm when something goes wrong.  Storm is 
awesome but it was built for speed and scalability (which is super-awesome).  I 
just have to remind myself to use it for what I really need, which is to spread 
many processes across many commodity servers. 

Craig Charleton
[email protected]

> On Nov 9, 2015, at 8:36 AM, Stephen Powis <[email protected]> wrote:
> 
> Hey Craig,
> 
> Just out of curiosity, how are you interacting with mysql?  Via
> hibernate or something else?
> 
> Thanks!
> 
> 
>> On Mon, Nov 9, 2015 at 9:32 PM,  <[email protected]> wrote:
>> I have been working on a project that requires a lot of calculation and
>> retention of values in the bolts and here are some questions/considerations
>> that I think will help you:
>> 
>> - You should read this if you already haven't.  I must admit I had to read
>> through it many times before I got the concept:
>> http://storm.apache.org/documentation/Trident-state.html
>> 
>> - When a bolt goes down, Storm will recover it automatically.  Any of the in
>> memory values that have been calculated will be lost unless you persist the
>> state using Trident.
>> 
>> - When persisting the state in Trident (saving it somewhere so Storm can
>> reconstitute the values when restarting the Bolt) you have to decide how
>> accurate the values calculated by the bolt need to be.  This point is not
>> discussed in the information that I found on Storm/Trident.  Without writing
>> thousands of words, my project required that the values calculated in a
>> Trident Bolt never be incorrect (complex financial). So I had to make sure
>> that when Storm obtained the Trident state to place into a Bolt for recovery
>> from a persistent store, that the values it used must be ACID compliant.
>> Therefore, I couldn't use  Cassandra or any other non-ACID compliant
>> persistent storage because of the risk (however large or small) of the
>> values stored in Cassandra not being completely accurate.  After a lot of
>> analysis and lost-sleep, I decided to use MySQL to persist the in-process
>> state of any Bolts.  There are some other persistence solutions that will
>> scale better than MySQL.  However, MySQL is still in use in huge
>> implementations and I estimated that I don't need a solution that can
>> process a million events a second but rather one that will process thousands
>> of events a second and make sure that, during start-up and recovery, the
>> values it uses reflect all the changes to the data.  There are some other
>> persistence solutions that are ACID-compliant and say they can process
>> faster than MySQL.  MemSQL and VoltDB looked promising.  However, they are
>> nowhere near as mature as MySQL and I have a lot of MySQL experience.
>> 
>> I would include more links to articles and git repos but I have to take my
>> child to school :-)
>> 
>> 
>> 
>> Craig Charleton
>> [email protected]
>> 
>> 
>> On Nov 7, 2015, at 6:27 AM, Miguel Ángel Fernández Fernández
>> <[email protected]> wrote:
>> 
>> In a trident scenario, a realtime operation needs to know the previous
>> calculated result.
>> 
>> My current solution is very poor and probably incorrect (a hashmap in
>> bolts). Now I'm thinking to incorporate a cache (redis, memcached ...)
>> 
>> However, I suppose that there is a standard solution for this problem in
>> Trident (maybe a special state).
>> 
>> What do you think is the best approach?
>> 
>> Thanks for your time

Re: Store previous calculated result

Reply via email to