Re: Deploy a Nutch crawler or use Webhose.io?

Lewis John Mcgibbney Mon, 14 Dec 2015 10:39:05 -0800

Hi Jon,

On Mon, Dec 14, 2015 at 10:22 AM, <[email protected]> wrote:


>
> I need to harvest blog posts and news articles and extract their date, the
> author, the text, the title and the comments if possible. The way I see it
> I have two choices, deploy a Nutch crawler or as a friend suggested, use
> Webhose.io.
>
> The Webhose.io site has it's own Build or Buy
> <https://webhose.io/white-papers/build-or-buy> comparison, but I wanted to
> hear a Nutch user take on it.
>
> Why did you go with Nutch and not with a service like Webhose.io? Where is
> the catch?
>
>
The link you've provided is of course not made available to communicate a
non-bias opinion. It is marketing spiel. Regardless, the figures that
something like webhose.io can potentially save you are pretty appealing (I
would of course take these with a pinch of salt as well).
 Finally it's important to understand that Scrapy is not a web crawler as
stated by the webhose.io article. Scrapy is an HTML scraper. It's in the
name. It maintains no web database (I use this term loosely to refer to a
record of the URI's you've crawled), on the other hand Nutch does.

On a different note, I have had experience using GNIP [0]. We had a number
of issues with this product namely the following. We were targeting Twitter
content. We were targetting hashtags. The hashtag BYOB (Be Your Own Boss)
can also mean 'Bring Your Own Booze' depending on whether it was Friday
night or Wednesday afternoon. The lack of functionality to differentiate
meant that a more intelligent rule based approach is required to correctly
target the content you want.

If you do not know Nutch at all then there is no doubt that there is a
learning curve. If on the other hand you wish to have someone else do all
of the data acquisition for you e.g. webhose.io then it means you will
never know how they got the content. If you want to reproduce the exercise
then you are reliant upon them. If you are able to do it in house with
Nutch then it is yours, you won it and you can adapt it however you like.
Lewis

[0] https://www.gnip.com/

Re: Deploy a Nutch crawler or use Webhose.io?

Reply via email to