Yep, sounds like you got it......you'd want to use field grouping and group on a field that contains the hash. Then every tuple that has that field with the identical hashes would get sent to the same bolt instance.
On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas < [email protected]> wrote: > Making sure that duplicates make it in the same XML file (third bolt). > > > > Kind Regards, > > *Andréas Kalogéropoulos* > > > > *From:* Stephen Powis [mailto:[email protected]] > *Sent:* Tuesday, December 01, 2015 11:59 AM > > *To:* [email protected] > *Subject:* Re: Using Storm to parse emails and creates batches > > > > So you want to eliminate duplicates or make sure that duplicates make it > into the same XML file (third bolt)? > > > > On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas < > [email protected]> wrote: > > Hello Stephen, > > > > Imagine that the spout is providing me 300 000 emails per hour. > > The first bolt will parse/analyze the information (from, to, cc, subject, > object, date, has of attachments, … , and probably will find the same hash > for some attachments (someone forwarding an email). > > > > The last bolt will create an XML based on all this information, but if I > can have the tuples containing the same attachment (based on hash) in the > same XML, I can actually apply a dedup logic : having multiple lines in my > xml pointing to the same file > > > > Does this make more sense ? > > > > Kind Regards, > > *Andréas Kalogéropoulos* > > > > *From:* Stephen Powis [mailto:[email protected]] > *Sent:* Tuesday, December 01, 2015 11:36 AM > > > *To:* [email protected] > *Subject:* Re: Using Storm to parse emails and creates batches > > > > I'm not sure I follow/understand your question or what you're trying to do. > > > > On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas < > [email protected]> wrote: > > You are right. Sorry for making you state the obvious J. > > > > Last question : If my spout has incoming information that I want to have > in the same last bolt (the one creating the XML) for deduplication logic, > what is the best way to achieve this ? My instinct says to try to work > with Fields grouping and the correct key (probably conversation since I am > working with emails). > > > > Kind Regards, > > *Andréas Kalogéropoulos* > > > > *From:* Stephen Powis [mailto:[email protected]] > *Sent:* Tuesday, December 01, 2015 10:27 AM > > > *To:* [email protected] > *Subject:* Re: Using Storm to parse emails and creates batches > > > > If you are using Storm's guaranteed message processing > <http://storm.apache.org/documentation/Guaranteeing-message-processing.html> > there is no need to 'persist' the collection anywhere other than in > memory. IE List<Tuple> myListOfTuples = new ArrayList<Tuple>(); If the > third bolt crashes and loses its in memory collection, after > Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS > <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS> > the tuples will timeout and be replayed thru your entire storm topology and > your collection will be repopulated > > > > On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas < > [email protected]> wrote: > > Hello Stephen, > > > > I think you got I correctly. Thanks a lot for the idea. > > If you have seen limitations, please send the disclaimers J . For > example, how did you handle persistence of this collection ? If the third > bolt failed while populating the collection (size and time has not been > reached) we just lost everything, so I need to have a status loopback of > what was really output. Right ? > > > > Of course, if you can send me the code of your third bolt (especially the > collection handling), I’ll be grateful. > > In all cases, thanks a lot for your help, even without the code, you > really give me example advice, and now I can start building something. > > > > Kind Regards, > > *Andréas Kalogéropoulos* > > > > *From:* Stephen Powis [mailto:[email protected]] > *Sent:* Monday, November 30, 2015 5:55 PM > *To:* [email protected] > *Subject:* Re: Using Storm to parse emails and creates batches > > > > From what I understand from your description, you want bolt 3 to collect > results from multiple tuples and build a single xml for them. We've done > this by essentially doing the following: > > > > Bolt 3 has a collection of tuples. As a tuple comes in, we add it to the > collection and check the size of the collection. Once the size of the > collection exceeds some number, we then process all of the tuples in one > go, and then ACK all of them after the processing completes. > > Building on that, we've implemented an additional constraint on time. If > the collection size > N OR if we've waited more than X seconds, process the > batch. This way your output won't stall out if your topology has a lull in > data being ingested. > > And then lastly, there's a corner case where say 10 tuples come in and get > held by our collection but then no other tuples come in for a long period > of time. If no tuples enter, that means the size and timeout checks are > never executed and your bolt will hold onto those tuples for a long time > (potentially causing timeouts). To handle this, we made use of tick > tuples. Tick tuples essentially allow you to you to send a special tuple > to your bolt every Y seconds. We use that to trigger checking the time > constraint is checked on a regular basis (example being send a tick tuple > every 1, 5, or 10 seconds) > > > > On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas < > [email protected]> wrote: > > Hello, > > > > I want to use Storm to do three things : > > 1. Parse emails data (from/ to / cc/ subject ) from incoming SMTP > source > > 2. Add additional information (based on sender email) > > 3. Create an XML based on this data, to inject in another solution > > > > Only issue, I want step 1 (and 2) to be as fast as possible so creating > the maximum bolts/tasks possible, > > But I want the XML to be as big as possible so gathering information for > multiple output of bolts. > > > > In this logic, I fi have 100 mails per second in original input, I would > want to have step1 and step 2 to work on the smallest number of emails to > do it faster. > > But I still want to be able to have an XML that represent 10 000+ emails > at the end. > > > > I can’t think of topology to address this. > > Can someone give me some pointers to the best way to handle this ? > > > > > > Kind Regards, > > *Andréas Kalogéropoulos* > > > > > > > > > > >
