Hello Andrea,

Please check my inline answers below.  However, I think its not the
topology that is puzzling you (since you already defined the workflow in
steps), rather, the semantics of data involved. To be more precise, you
seem to need some state maintained on different bolts. You have to define
how often the state is updated, where it is stored, whether it is
window-based or is historically accumulated etc. Also, if you manage to
have your operators work in a stete-less way (apply functions on each input
tuple), then the challenging part would be to mitigate any I/O (i.e.
contact an external storage) and the processing cost. I hope that you will
find my email useful.


On Mon, Nov 30, 2015 at 11:42 AM, Kalogeropoulos, Andreas <
[email protected]> wrote:

> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
For this part, you have to consider the semantics of your processing. For
instance, does the processing involve any state maintenance? If not, it is
simply a "filtering" bolt, so you can be really flexible on its
performance. In fact, you can start with an initial parallelism hint
(number of threads executing the filtering mechanism) and then either
scale-up/down according to the actual performance during runtime (capacity
reached by those bolts)

> 2.       Add additional information (based on sender email)
>
This part looks like its going to perform an I/O in order to get more
information (right?). If yes, you need to consider different engineering
ways on how you can retrieve these data. If not, and you get additional
information from the actual mail, then again you can apply the same idea as
in Step 1.

> 3.       Create an XML based on this data, to inject in another solution
>
This part is tricky because it is not clear to me whether those XMLs
contain aggregated information, or they are build separately based on the
input that each bolt receives. In the former case, you will need to
engineer your desired aggregate operations based on your application
semantics. In the latter, each bolt can produce its XML based on the input
it received in a logical window of operation (either purely time-based or
tuple-based).

>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>



-- 
Nick R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Reply via email to