Is storm suitable for doing large amount data process?

churly lin Tue, 28 Jan 2014 01:45:48 -0800

Hi, all:
We have a problem in recently project. It needs to do some complicated
calculate in one bolt. The logical is like below.
Firstly, we have a very large amount sample data in hdfs. The data is about
1000 GB and each line in the sample data is a record(string), like:


    This is first sentence
    This is another sentence
    ...
    The last sentence

All we want to do is, every time spout reads a record from message queue
then emits it, the bolt finds out the 1000 records from the 1000 GB sample
data with the best similarity. The similarity just represents how many
words are different between the two string records. We have no idea how to
realize this project with storm, or is it suitable for storm to do this? We
really want it to be a real-time stream process.

P.S.
we have come up with an idea. Realize it with the help of hadoop map/reduce
job. Every time Bolt reads a tuple from spout, submits it to hadoop cluster
to do the calculate(we know how to realize this with map/reduce, but it's
not a real-time stream process), then reads the return job of hadoop job.*
But how to execute a map/reduce job in storm?*

Thanks.

Is storm suitable for doing large amount data process?

Reply via email to