RE: Deploy a Nutch crawler or use Webhose.io?

Markus Jelsma Tue, 15 Dec 2015 11:36:06 -0800

Hello - this is not very straightforward to implement, even for a single site, 
but doable. If there are multiple sites to extract data from, you can either 
use a party like webhose or any other company that provides scraping services.


Re: Scrapy. It is not as robust at scale as Nutch indeed. But it is quite 
usable for very specific extractions tasks. But you do need to program XPaths 
and stuff for every different page you want to extract data from.

We have used Scrapy in the past for getting data from many sites, but it was 
very labour intensive. So we usually prefer an algorithm that works for most 
sites, extracting article dates and content. But it is not perfect, only much 
less labour intensive.

M.
 
-----Original message-----
> From:Lewis John Mcgibbney <[email protected]>
> Sent: Monday 14th December 2015 19:36
> To: [email protected]
> Subject: Re: Deploy a Nutch crawler or use Webhose.io?
> 
> Hi Jon,
> 
> On Mon, Dec 14, 2015 at 10:22 AM, <[email protected]> wrote:
> 
> >
> > I need to harvest blog posts and news articles and extract their date, the
> > author, the text, the title and the comments if possible. The way I see it
> > I have two choices, deploy a Nutch crawler or as a friend suggested, use
> > Webhose.io.
> >
> > The Webhose.io site has it's own Build or Buy
> > <https://webhose.io/white-papers/build-or-buy> comparison, but I wanted to
> > hear a Nutch user take on it.
> >
> > Why did you go with Nutch and not with a service like Webhose.io? Where is
> > the catch?
> >
> >
> The link you've provided is of course not made available to communicate a
> non-bias opinion. It is marketing spiel. Regardless, the figures that
> something like webhose.io can potentially save you are pretty appealing (I
> would of course take these with a pinch of salt as well).
>  Finally it's important to understand that Scrapy is not a web crawler as
> stated by the webhose.io article. Scrapy is an HTML scraper. It's in the
> name. It maintains no web database (I use this term loosely to refer to a
> record of the URI's you've crawled), on the other hand Nutch does.
> 
> On a different note, I have had experience using GNIP [0]. We had a number
> of issues with this product namely the following. We were targeting Twitter
> content. We were targetting hashtags. The hashtag BYOB (Be Your Own Boss)
> can also mean 'Bring Your Own Booze' depending on whether it was Friday
> night or Wednesday afternoon. The lack of functionality to differentiate
> meant that a more intelligent rule based approach is required to correctly
> target the content you want.
> 
> If you do not know Nutch at all then there is no doubt that there is a
> learning curve. If on the other hand you wish to have someone else do all
> of the data acquisition for you e.g. webhose.io then it means you will
> never know how they got the content. If you want to reproduce the exercise
> then you are reliant upon them. If you are able to do it in house with
> Nutch then it is yours, you won it and you can adapt it however you like.
> Lewis
> 
> [0] https://www.gnip.com/
>

RE: Deploy a Nutch crawler or use Webhose.io?

Reply via email to