Hi all,

This is a bit of a read, but I was hoping I could get some opinions on this
from experienced Storm users.

I have developed a prototype for a distributed, fault-tolerant text
processing system in Storm, however given my supervisor’s requirements, I’m
not sure if Storm is a good fit anymore.

Here are the most relevant requirements:

   - Text documents are processed through a directed acyclic graph of
   services. These services apply (mostly) natural language processing
   algorithms to the documents.
   - Text documents can be streamed, one document at a time;
   - or processed in large batches.
   - The user may choose to apply *arbitrary* services to documents in
   *arbitrary* input order so long as the resultant processing graph is
   acyclic. The processing graph may be changed per document at the user’s
   whim, via a web interface. This graph of service application is called a
   *workflow*, and initially, seems very similar to a topology, but is
   dynamic and can be changed at runtime, per each document.
   - The services are disparate and unruly Java libraries which may be
*deployed
   and disabled dynamically* by a system administrator without causing any
   downtime for the rest of the services.
   - The system administrator wants to choose exactly which nodes run what
   services. The reason why has not be explicitly specified.
   - Results of the services are persisted with the original document text
   in a document-oriented database.
   - The system administrator wants any unique piece of text to be
   processed *only once* – and any time a service is about to process
   arguments that it has already processed, it should skip this stage of
   processing and simply load the results cached in the database.

------------------------------

I have tried many *terrible* things to get this thing running in Storm,
including: writing my own scheduler that deploys services to specific
nodes *based
on the bolt name* (we all make mistakes); loading services dynamically in
the prepare() method for bolts; dynamically generating topologies from a
YAML or JSON config file (though this definitely does not solve my problem
of having *dynamic* topologies); and partially (and poorly) reimplementing
Trident – accidentally – to support joins for certain n-ary services.

The result after all of this effort is unsatisfactory. As well, this entire
project is being written, tested, and deployed by just one person, so the
complexity involved is getting a bit overwhelming.

I’d like to know if I’m forcing Storm to do things it’s not apt to do;
initially, Storm sounded perfect for our application, but now I’m having my
doubts. Any opinions on this?

Thanks,
Eddie

Reply via email to