Re: Storm transaction boundary

pradeep s Thu, 14 Apr 2016 18:09:36 -0700

Thanks Jon. Ours is exactly similar usecase .We thought of using redis
cache as the storage layer.but since we can get bulk updates from the
source ,cache value size can become pretty large.For getting the total
count of messages in a transaction we are keeping a counter against Tran
id  key and this is sent as part of the last message in the transaction.
There is a islast flag available from goldengate.But this can come in any
order in storm.So we are using the count which is coming in the last
message from goldengate.
In database bolt we can continue to hold the messages in a store till we
get the total count.
Main challenge is maintaining and updating the count in multithreaded bolt
processing and storing sometimes a GB worth of message in case of bulk
updates. Any suggestions?
On Apr 14, 2016 9:22 AM, "John Bush" <[email protected]> wrote:


> We do something kinda similar.  I think you will need another store to
> keep track of these, not sure about storm's distributed cache, we use
> cassandra, but you could use zookeeper, or some other store.  The
> issue is since rows come out of storm in no guaranteed order, you
> really don't know when you are done.  You need to know when you are
> complete in order to do remove the messages from your source (or
> otherwise update stuff over there).
>
> So what we do is keep track of how many rows are on the read side (in
> some store).  Then as we process we update on the write side, how many
> rows we wrote.  Then by checking this count, how many we updated vs
> how many we expected in total, we know when we are done.  It sounds
> like you situation might be more complicated than ours if you talking
> about rows from many different tables all inside same transaction, but
> in any event some type of pattern like this should work.
>
> For perspective, we essential ETL data out of 100's of tables like
> this into Cassandra, and it works quite well.  You just need to be
> super careful with the completion logic there are many edge cases to
> consider.
>
> On Thu, Apr 14, 2016 at 9:00 AM, Nikos R. Katsipoulakis
> <[email protected]> wrote:
> > Hello Sreekumar,
> >
> > Have you thought of using Storm's distributed cache? If not, that might a
> > way to cache messages before you push them to the target DB. Another way
> to
> > do so, is if you can create your own Bolt to periodically push messages
> in
> > the database.
> >
> > I hope I helped.
> >
> > Cheers,
> > Nikos
> >
> > On Thu, Apr 14, 2016 at 12:54 AM, pradeep s <[email protected]
> >
> > wrote:
> >>
> >> Hi,
> >> We are using Storm for processing CDC messages from Oracle Golden Gate .
> >> Pipeline is as below
> >> Oracle GoldenGate-->Queue-->Storm-->Relational DB
> >>  We have a requirement to hold the messages for a transaction Id till
> all
> >> the messages for that transaction is available in Storm. There can be
> >> scenarios like 1 million updates happening in onme transaction source
> oracle
> >> system.
> >> Can you please suggest a best approach for holding the messages and then
> >> pushing to target db only when all messages for tran id is available in
> >> storm.
> >>
> >> Regards
> >> Pradeep S
> >
> >
> >
> >
> > --
> > Nikos R. Katsipoulakis,
> > Department of Computer Science
> > University of Pittsburgh
>
>
>
> --
>
> John Bush
> Trax Technologies, Inc.
> M: 480-227-2910
> TraxTech.Com
>
> --
> CONFIDENTIALITY NOTICE: The preceding and/or attached information may be
> confidential or privileged. It should be used or disseminated solely for
> the purpose of conducting business with Trax. If you are not an intended
> recipient, please notify the sender by replying to this message and then
> delete the information from your system. Thank you for your cooperation.
>

Re: Storm transaction boundary

Reply via email to