Re: Storm transaction boundary

John Bush Fri, 15 Apr 2016 00:04:09 -0700

The I/O with redis would concern me, admittedly I don't have much exp with
it.  You can't write into your destination store directly as they come in?
Do you need some kind of transactional support on the destination too ?
What are you writing into ?


In cases where you need transactional support on the destination, what I've
done (this is with sql server) is write to a temporary table and then
insert from there into final destination when I'm done all in one
transaction.  This eliminates some of the I/O issues since the data is at
least in the right store already.

On Thu, Apr 14, 2016 at 11:34 PM, pradeep s <[email protected]>
wrote:

> Thanks John for the details on storm design. I am planning below approach
> for our design.
> Keep the messages in redis and increment the message count. Have a key for
> total count from the last message.
> This message can come at any time since ordering is not guaranteed.
> Each time you receive a message, check wether total count is matched with
> receiving message count. If matched push the messages to database.
> Since we have 1 million row updates in a single transaction , planning to
> compress the message dto while putting to cache.
> When counts are matched , get the list of dtos and take sublists and
> insert to database . Call commit after the last sublist is processed.
> Do you see any issues with this approach?
>
>
> On Thu, Apr 14, 2016 at 8:28 PM, John Bush <[email protected]>
> wrote:
>
>> We follow a pattern like this, to batch them in memory:
>> http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/
>>
>> Biggest gotcha I just ran into was the collector isn't thread safe, so
>> make sure you don't ever ack/fail in another thread, I can tell you how I
>> got around that if you have that problem, just read my blog:
>> https://scalalala.wordpress.com/2016/04/09/async-stormy-weather/
>>
>> The biggest table we've copied doing this was about 10 million rows, with
>> 100 or so columns.  It works great.  I actually am pulling from 3 different
>> data sources this way, sql server, postgresql, and mongo, and then feed
>> into cassandra and elasticsearch.
>>
>> We use a thread safe counter, AtomicCounter in guava, there is also
>> AtomicInteger in core Java, not really sure the difference.  We have a UI
>> that gives visibility into where things are with the loading, so now we
>> flush the counts out to Cassandra as well to serve the UI, but really
>> completion is detected using the AtomicCounter.  Experiment with fetch size
>> on the jdbc read and batch size on the write, probably depends on your
>> specific environment.  The memory overhead is super low for us, because we
>> keep our batch size on writes fairly small, since Cassandra is async on the
>> writes anyway, if you are going into a RDMS system that would be different.
>>
>> On Thu, Apr 14, 2016 at 6:09 PM, pradeep s <[email protected]>
>> wrote:
>>
>>> Thanks Jon. Ours is exactly similar usecase .We thought of using redis
>>> cache as the storage layer.but since we can get bulk updates from the
>>> source ,cache value size can become pretty large.For getting the total
>>> count of messages in a transaction we are keeping a counter against Tran
>>> id  key and this is sent as part of the last message in the transaction.
>>> There is a islast flag available from goldengate.But this can come in any
>>> order in storm.So we are using the count which is coming in the last
>>> message from goldengate.
>>> In database bolt we can continue to hold the messages in a store till we
>>> get the total count.
>>> Main challenge is maintaining and updating the count in multithreaded
>>> bolt processing and storing sometimes a GB worth of message in case of bulk
>>> updates. Any suggestions?
>>> On Apr 14, 2016 9:22 AM, "John Bush" <[email protected]> wrote:
>>>
>>>> We do something kinda similar.  I think you will need another store to
>>>> keep track of these, not sure about storm's distributed cache, we use
>>>> cassandra, but you could use zookeeper, or some other store.  The
>>>> issue is since rows come out of storm in no guaranteed order, you
>>>> really don't know when you are done.  You need to know when you are
>>>> complete in order to do remove the messages from your source (or
>>>> otherwise update stuff over there).
>>>>
>>>> So what we do is keep track of how many rows are on the read side (in
>>>> some store).  Then as we process we update on the write side, how many
>>>> rows we wrote.  Then by checking this count, how many we updated vs
>>>> how many we expected in total, we know when we are done.  It sounds
>>>> like you situation might be more complicated than ours if you talking
>>>> about rows from many different tables all inside same transaction, but
>>>> in any event some type of pattern like this should work.
>>>>
>>>> For perspective, we essential ETL data out of 100's of tables like
>>>> this into Cassandra, and it works quite well.  You just need to be
>>>> super careful with the completion logic there are many edge cases to
>>>> consider.
>>>>
>>>> On Thu, Apr 14, 2016 at 9:00 AM, Nikos R. Katsipoulakis
>>>> <[email protected]> wrote:
>>>> > Hello Sreekumar,
>>>> >
>>>> > Have you thought of using Storm's distributed cache? If not, that
>>>> might a
>>>> > way to cache messages before you push them to the target DB. Another
>>>> way to
>>>> > do so, is if you can create your own Bolt to periodically push
>>>> messages in
>>>> > the database.
>>>> >
>>>> > I hope I helped.
>>>> >
>>>> > Cheers,
>>>> > Nikos
>>>> >
>>>> > On Thu, Apr 14, 2016 at 12:54 AM, pradeep s <
>>>> [email protected]>
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >> We are using Storm for processing CDC messages from Oracle Golden
>>>> Gate .
>>>> >> Pipeline is as below
>>>> >> Oracle GoldenGate-->Queue-->Storm-->Relational DB
>>>> >>  We have a requirement to hold the messages for a transaction Id
>>>> till all
>>>> >> the messages for that transaction is available in Storm. There can be
>>>> >> scenarios like 1 million updates happening in onme transaction
>>>> source oracle
>>>> >> system.
>>>> >> Can you please suggest a best approach for holding the messages and
>>>> then
>>>> >> pushing to target db only when all messages for tran id is available
>>>> in
>>>> >> storm.
>>>> >>
>>>> >> Regards
>>>> >> Pradeep S
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Nikos R. Katsipoulakis,
>>>> > Department of Computer Science
>>>> > University of Pittsburgh
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> John Bush
>>>> Trax Technologies, Inc.
>>>> M: 480-227-2910
>>>> TraxTech.Com
>>>>
>>>> --
>>>> CONFIDENTIALITY NOTICE: The preceding and/or attached information may be
>>>> confidential or privileged. It should be used or disseminated solely for
>>>> the purpose of conducting business with Trax. If you are not an intended
>>>> recipient, please notify the sender by replying to this message and then
>>>> delete the information from your system. Thank you for your cooperation.
>>>>
>>>
>>
>

Re: Storm transaction boundary

Reply via email to