Hi all, This is a bit of a read, but I was hoping I could get some opinions on this from experienced Storm users.
I have developed a prototype for a distributed, fault-tolerant text processing system in Storm, however given my supervisor’s requirements, I’m not sure if Storm is a good fit anymore. Here are the most relevant requirements: - Text documents are processed through a directed acyclic graph of services. These services apply (mostly) natural language processing algorithms to the documents. - Text documents can be streamed, one document at a time; - or processed in large batches. - The user may choose to apply *arbitrary* services to documents in *arbitrary* input order so long as the resultant processing graph is acyclic. The processing graph may be changed per document at the user’s whim, via a web interface. This graph of service application is called a *workflow*, and initially, seems very similar to a topology, but is dynamic and can be changed at runtime, per each document. - The services are disparate and unruly Java libraries which may be *deployed and disabled dynamically* by a system administrator without causing any downtime for the rest of the services. - The system administrator wants to choose exactly which nodes run what services. The reason why has not be explicitly specified. - Results of the services are persisted with the original document text in a document-oriented database. - The system administrator wants any unique piece of text to be processed *only once* – and any time a service is about to process arguments that it has already processed, it should skip this stage of processing and simply load the results cached in the database. ------------------------------ I have tried many *terrible* things to get this thing running in Storm, including: writing my own scheduler that deploys services to specific nodes *based on the bolt name* (we all make mistakes); loading services dynamically in the prepare() method for bolts; dynamically generating topologies from a YAML or JSON config file (though this definitely does not solve my problem of having *dynamic* topologies); and partially (and poorly) reimplementing Trident – accidentally – to support joins for certain n-ary services. The result after all of this effort is unsatisfactory. As well, this entire project is being written, tested, and deployed by just one person, so the complexity involved is getting a bit overwhelming. I’d like to know if I’m forcing Storm to do things it’s not apt to do; initially, Storm sounded perfect for our application, but now I’m having my doubts. Any opinions on this? Thanks, Eddie
