We follow a pattern like this, to batch them in memory:
http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/

Biggest gotcha I just ran into was the collector isn't thread safe, so make
sure you don't ever ack/fail in another thread, I can tell you how I got
around that if you have that problem, just read my blog:
https://scalalala.wordpress.com/2016/04/09/async-stormy-weather/

The biggest table we've copied doing this was about 10 million rows, with
100 or so columns.  It works great.  I actually am pulling from 3 different
data sources this way, sql server, postgresql, and mongo, and then feed
into cassandra and elasticsearch.

We use a thread safe counter, AtomicCounter in guava, there is also
AtomicInteger in core Java, not really sure the difference.  We have a UI
that gives visibility into where things are with the loading, so now we
flush the counts out to Cassandra as well to serve the UI, but really
completion is detected using the AtomicCounter.  Experiment with fetch size
on the jdbc read and batch size on the write, probably depends on your
specific environment.  The memory overhead is super low for us, because we
keep our batch size on writes fairly small, since Cassandra is async on the
writes anyway, if you are going into a RDMS system that would be different.

On Thu, Apr 14, 2016 at 6:09 PM, pradeep s <[email protected]>
wrote:

> Thanks Jon. Ours is exactly similar usecase .We thought of using redis
> cache as the storage layer.but since we can get bulk updates from the
> source ,cache value size can become pretty large.For getting the total
> count of messages in a transaction we are keeping a counter against Tran
> id  key and this is sent as part of the last message in the transaction.
> There is a islast flag available from goldengate.But this can come in any
> order in storm.So we are using the count which is coming in the last
> message from goldengate.
> In database bolt we can continue to hold the messages in a store till we
> get the total count.
> Main challenge is maintaining and updating the count in multithreaded bolt
> processing and storing sometimes a GB worth of message in case of bulk
> updates. Any suggestions?
> On Apr 14, 2016 9:22 AM, "John Bush" <[email protected]> wrote:
>
>> We do something kinda similar.  I think you will need another store to
>> keep track of these, not sure about storm's distributed cache, we use
>> cassandra, but you could use zookeeper, or some other store.  The
>> issue is since rows come out of storm in no guaranteed order, you
>> really don't know when you are done.  You need to know when you are
>> complete in order to do remove the messages from your source (or
>> otherwise update stuff over there).
>>
>> So what we do is keep track of how many rows are on the read side (in
>> some store).  Then as we process we update on the write side, how many
>> rows we wrote.  Then by checking this count, how many we updated vs
>> how many we expected in total, we know when we are done.  It sounds
>> like you situation might be more complicated than ours if you talking
>> about rows from many different tables all inside same transaction, but
>> in any event some type of pattern like this should work.
>>
>> For perspective, we essential ETL data out of 100's of tables like
>> this into Cassandra, and it works quite well.  You just need to be
>> super careful with the completion logic there are many edge cases to
>> consider.
>>
>> On Thu, Apr 14, 2016 at 9:00 AM, Nikos R. Katsipoulakis
>> <[email protected]> wrote:
>> > Hello Sreekumar,
>> >
>> > Have you thought of using Storm's distributed cache? If not, that might
>> a
>> > way to cache messages before you push them to the target DB. Another
>> way to
>> > do so, is if you can create your own Bolt to periodically push messages
>> in
>> > the database.
>> >
>> > I hope I helped.
>> >
>> > Cheers,
>> > Nikos
>> >
>> > On Thu, Apr 14, 2016 at 12:54 AM, pradeep s <
>> [email protected]>
>> > wrote:
>> >>
>> >> Hi,
>> >> We are using Storm for processing CDC messages from Oracle Golden Gate
>> .
>> >> Pipeline is as below
>> >> Oracle GoldenGate-->Queue-->Storm-->Relational DB
>> >>  We have a requirement to hold the messages for a transaction Id till
>> all
>> >> the messages for that transaction is available in Storm. There can be
>> >> scenarios like 1 million updates happening in onme transaction source
>> oracle
>> >> system.
>> >> Can you please suggest a best approach for holding the messages and
>> then
>> >> pushing to target db only when all messages for tran id is available in
>> >> storm.
>> >>
>> >> Regards
>> >> Pradeep S
>> >
>> >
>> >
>> >
>> > --
>> > Nikos R. Katsipoulakis,
>> > Department of Computer Science
>> > University of Pittsburgh
>>
>>
>>
>> --
>>
>> John Bush
>> Trax Technologies, Inc.
>> M: 480-227-2910
>> TraxTech.Com
>>
>> --
>> CONFIDENTIALITY NOTICE: The preceding and/or attached information may be
>> confidential or privileged. It should be used or disseminated solely for
>> the purpose of conducting business with Trax. If you are not an intended
>> recipient, please notify the sender by replying to this message and then
>> delete the information from your system. Thank you for your cooperation.
>>
>

Reply via email to