Re: Crawler and Scraper with different priorities

2014-09-09 Thread Peng Cheng
Hi Sandeep,

would you be interesting in joining my open source project?

https://github.com/tribbloid/spookystuff

IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and deep graph following & extraction. Please
drop me a line if you have a user case, as I'll try to integrate it as a
feature.

Yours Peng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13838.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Crawler and Scraper with different priorities

2014-09-08 Thread Sandeep Singh
Hi Daniil,

I have to do some processing of the results, as well as pushing the data to
the front end. Currently I'm using akka for this application, but I was
thinking maybe spark streaming would be a better thing to do. as well as i
can use mllib for processing the results. Any specific reason's why spark
streaming won't be better than akka ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Crawler and Scraper with different priorities

2014-09-08 Thread Daniil Osipov
Depending on what you want to do with the result of the scraping, Spark may
not be the best framework for your use case. Take a look at a general Akka
application.

On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh 
wrote:

> Hi all,
>
> I am Implementing a Crawler, Scraper. The It should be able to process the
> request for crawling & scraping, within few seconds of submitting the
> job(around 1mil/sec), for rest I can take some time(scheduled evenly all
> over the day). What is the best way to implement this?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>