One approach would be to have additional field in the tuple with taskId,
then track all the tuples with ack and fail method on the spout. (job is
done when  # of acked + # of fails records=total for the job)
Total should be somewhere outside of storm, in fast data storage, then you
don't have to create new topology for each task.
This approach has disadvantages - storm has to track tuples (affecting
throughput) and have a separate data storage. but that maybe no problem in
your use case.

This approach should works if you know how many records you expecting to
process per job.





On Wed, Mar 19, 2014 at 5:38 PM, Eugene Dzhurinsky <[email protected]>wrote:

> On Wed, Mar 19, 2014 at 05:14:28PM -0400, P. Taylor Goetz wrote:
> > Have you considered using DRPC [1]?
> >
> > [1] http://storm.incubator.apache.org/documentation/Distributed-RPC.html
>
> From what I've read so far - it seems to create and deploy a topology,
> which I
> could also do here (and probably much simpler, since it's a concrete case -
> not a generic one).
>
> So if a creation of a new topology per "task" is not considered as a "bad
> practice" - the only thing which I can't get is how to track that a task
> has
> actually finished and all data from the spout was received in sink.
> Perhaps I
> could add some sort of "record number" (which I may know in advance) and
> process the completion event if the sink has received the expected number
> of
> records.
>
> --
> Eugene N Dzhurinsky
>



-- 
Eugene Dvorkin
Software Engineer
New York City Storm User Group - organizer
WebMD
email: [email protected]
phone: 646-250-9649
eugenedvorkin.com
Connect with me on:
LinkedIn <http://www.linkedin.com/in/eugenedvorkin>
Twitter<http://twitter.com/edvorkin>

Reply via email to