Re: Crawler and Scraper with different priorities
Hi Sandeep, would you be interesting in joining my open source project? https://github.com/tribbloid/spookystuff IMHO spark is indeed not for general purpose crawling, of which distributed job is highly homogeneous. But good enough for directional scraping which involves heterogeneous input and deep graph following & extraction. Please drop me a line if you have a user case, as I'll try to integrate it as a feature. Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13838.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Crawler and Scraper with different priorities
Hi Daniil, I have to do some processing of the results, as well as pushing the data to the front end. Currently I'm using akka for this application, but I was thinking maybe spark streaming would be a better thing to do. as well as i can use mllib for processing the results. Any specific reason's why spark streaming won't be better than akka ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13763.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Crawler and Scraper with different priorities
Depending on what you want to do with the result of the scraping, Spark may not be the best framework for your use case. Take a look at a general Akka application. On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh wrote: > Hi all, > > I am Implementing a Crawler, Scraper. The It should be able to process the > request for crawling & scraping, within few seconds of submitting the > job(around 1mil/sec), for rest I can take some time(scheduled evenly all > over the day). What is the best way to implement this? > > Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >