Hi Jon, On Mon, Dec 14, 2015 at 10:22 AM, <[email protected]> wrote:
> > I need to harvest blog posts and news articles and extract their date, the > author, the text, the title and the comments if possible. The way I see it > I have two choices, deploy a Nutch crawler or as a friend suggested, use > Webhose.io. > > The Webhose.io site has it's own Build or Buy > <https://webhose.io/white-papers/build-or-buy> comparison, but I wanted to > hear a Nutch user take on it. > > Why did you go with Nutch and not with a service like Webhose.io? Where is > the catch? > > The link you've provided is of course not made available to communicate a non-bias opinion. It is marketing spiel. Regardless, the figures that something like webhose.io can potentially save you are pretty appealing (I would of course take these with a pinch of salt as well). Finally it's important to understand that Scrapy is not a web crawler as stated by the webhose.io article. Scrapy is an HTML scraper. It's in the name. It maintains no web database (I use this term loosely to refer to a record of the URI's you've crawled), on the other hand Nutch does. On a different note, I have had experience using GNIP [0]. We had a number of issues with this product namely the following. We were targeting Twitter content. We were targetting hashtags. The hashtag BYOB (Be Your Own Boss) can also mean 'Bring Your Own Booze' depending on whether it was Friday night or Wednesday afternoon. The lack of functionality to differentiate meant that a more intelligent rule based approach is required to correctly target the content you want. If you do not know Nutch at all then there is no doubt that there is a learning curve. If on the other hand you wish to have someone else do all of the data acquisition for you e.g. webhose.io then it means you will never know how they got the content. If you want to reproduce the exercise then you are reliant upon them. If you are able to do it in house with Nutch then it is yours, you won it and you can adapt it however you like. Lewis [0] https://www.gnip.com/

