Thanks Lewis. It's always the tension between control and price. I see what
you are saying and I will dive deeper into both solutions to see what are
the costs (in time and money) before I decide.

Thanks,

Jon

On Mon, Dec 14, 2015 at 8:36 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Jon,
>
> On Mon, Dec 14, 2015 at 10:22 AM, <[email protected]>
> wrote:
>
> >
> > I need to harvest blog posts and news articles and extract their date,
> the
> > author, the text, the title and the comments if possible. The way I see
> it
> > I have two choices, deploy a Nutch crawler or as a friend suggested, use
> > Webhose.io.
> >
> > The Webhose.io site has it's own Build or Buy
> > <https://webhose.io/white-papers/build-or-buy> comparison, but I wanted
> to
> > hear a Nutch user take on it.
> >
> > Why did you go with Nutch and not with a service like Webhose.io? Where
> is
> > the catch?
> >
> >
> The link you've provided is of course not made available to communicate a
> non-bias opinion. It is marketing spiel. Regardless, the figures that
> something like webhose.io can potentially save you are pretty appealing (I
> would of course take these with a pinch of salt as well).
>  Finally it's important to understand that Scrapy is not a web crawler as
> stated by the webhose.io article. Scrapy is an HTML scraper. It's in the
> name. It maintains no web database (I use this term loosely to refer to a
> record of the URI's you've crawled), on the other hand Nutch does.
>
> On a different note, I have had experience using GNIP [0]. We had a number
> of issues with this product namely the following. We were targeting Twitter
> content. We were targetting hashtags. The hashtag BYOB (Be Your Own Boss)
> can also mean 'Bring Your Own Booze' depending on whether it was Friday
> night or Wednesday afternoon. The lack of functionality to differentiate
> meant that a more intelligent rule based approach is required to correctly
> target the content you want.
>
> If you do not know Nutch at all then there is no doubt that there is a
> learning curve. If on the other hand you wish to have someone else do all
> of the data acquisition for you e.g. webhose.io then it means you will
> never know how they got the content. If you want to reproduce the exercise
> then you are reliant upon them. If you are able to do it in house with
> Nutch then it is yours, you won it and you can adapt it however you like.
> Lewis
>
> [0] https://www.gnip.com/
>

Reply via email to