Hi Albert, Since your latency requirement are around 1-2m, spark streaming should be a good solution. You may also want to check out if streaming and processing in Flume and writing out the results to HDFS, would suffice.
> crawling, keyword extraction, language detection, indexation > On each step we add additional data to the document, for example on the language extraction, we begin with a document without language, and we output the document with a new language field Can all the computations for a document be done in a single map function? Creating fewer number of intermediate objects should help improve performance. Thanks, Jayant On Thu, Oct 23, 2014 at 4:56 AM, Albert Vila <albert.v...@augure.com> wrote: > Hi Jayant, > > On 23 October 2014 11:14, Jayant Shekhar <jay...@cloudera.com> wrote: > >> Hi Albert, >> >> Have a couple of questions: >> >> - You mentioned near real-time. What exactly is your SLA for >> processing each document? >> >> The minimum the best :). Right now it's between 30s - 5m, but I would > like to have something stable arround 1-2m if possible. Taking into account > that the system should be able to scale to 50M - 100M documents. > > >> >> - Which crawler are you using and are you looking to bring in Hadoop >> into your overall workflow. You might want to read up on how network >> traffic is minimized/managed on the Hadoop cluster - as you had run into >> network issues with your current architecture. >> >> Everything is developed by us. The network issues were not related to the > crawler itself, they were related to the documents we were moving around > the system to be processed for each workflow stage. And yes, we are > currently researching if we can introduce Spark streaming to be able to > scale and execute all workflow stages and use Hdfs/Cassandra to store the > data. > > Should we use the DStream persist function (if we use every document as a > RDD), in order to reuse the same data or it's better to create new > DStreams? On each step we add additional data to the document, for example > on the language extraction, we begin with a document without language, and > we output the document with a new language field. > > Thanks > > >> Thanks! >> >> On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila <albert.v...@augure.com> >> wrote: >> >>> Hi >>> >>> I'm evaluating Spark streaming to see if it fits to scale or current >>> architecture. >>> >>> We are currently downloading and processing 6M documents per day from >>> online and social media. We have a different workflow for each type of >>> document, but some of the steps are keyword extraction, language detection, >>> clustering, classification, indexation, .... We are using Gearman to >>> dispatch the job to workers and we have some queues on a database. >>> Everything is in near real time. >>> >>> I'm wondering if we could integrate Spark streaming on the current >>> workflow and if it's feasible. One of our main discussions are if we have >>> to go to a fully distributed architecture or to a semi-distributed one. I >>> mean, distribute everything or process some steps on the same machine >>> (crawling, keyword extraction, language detection, indexation). We don't >>> know which one scales more, each one has pros and cont. >>> >>> Now we have a semi-distributed one as we had network problems taking >>> into account the amount of data we were moving around. So now, all >>> documents crawled on server X, later on are dispatched through Gearman to >>> the same server. What we dispatch on Gearman is only the document id, and >>> the document data remains on the crawling server on a Memcached, so the >>> network traffic is keep at minimum. >>> >>> It's feasible to remove all database queues and Gearman and move to >>> Spark streaming? We are evaluating to add Kakta to the system too. >>> Is anyone using Spark streaming for a system like ours? >>> Should we worry about the network traffic? or it's something Spark can >>> manage without problems. Every document is arround 50k (300Gb a day +/-). >>> If we wanted to isolate some steps to be processed on the same machine/s >>> (or give priority), is something we could do with Spark? >>> >>> Any help or comment will be appreciate. And If someone has had a similar >>> problem and has knowledge about the architecture approach will be more than >>> welcomed. >>> >>> Thanks >>> >>> >