Thanks Lewis. It's always the tension between control and price. I see what you are saying and I will dive deeper into both solutions to see what are the costs (in time and money) before I decide.
Thanks, Jon On Mon, Dec 14, 2015 at 8:36 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Jon, > > On Mon, Dec 14, 2015 at 10:22 AM, <[email protected]> > wrote: > > > > > I need to harvest blog posts and news articles and extract their date, > the > > author, the text, the title and the comments if possible. The way I see > it > > I have two choices, deploy a Nutch crawler or as a friend suggested, use > > Webhose.io. > > > > The Webhose.io site has it's own Build or Buy > > <https://webhose.io/white-papers/build-or-buy> comparison, but I wanted > to > > hear a Nutch user take on it. > > > > Why did you go with Nutch and not with a service like Webhose.io? Where > is > > the catch? > > > > > The link you've provided is of course not made available to communicate a > non-bias opinion. It is marketing spiel. Regardless, the figures that > something like webhose.io can potentially save you are pretty appealing (I > would of course take these with a pinch of salt as well). > Finally it's important to understand that Scrapy is not a web crawler as > stated by the webhose.io article. Scrapy is an HTML scraper. It's in the > name. It maintains no web database (I use this term loosely to refer to a > record of the URI's you've crawled), on the other hand Nutch does. > > On a different note, I have had experience using GNIP [0]. We had a number > of issues with this product namely the following. We were targeting Twitter > content. We were targetting hashtags. The hashtag BYOB (Be Your Own Boss) > can also mean 'Bring Your Own Booze' depending on whether it was Friday > night or Wednesday afternoon. The lack of functionality to differentiate > meant that a more intelligent rule based approach is required to correctly > target the content you want. > > If you do not know Nutch at all then there is no doubt that there is a > learning curve. If on the other hand you wish to have someone else do all > of the data acquisition for you e.g. webhose.io then it means you will > never know how they got the content. If you want to reproduce the exercise > then you are reliant upon them. If you are able to do it in house with > Nutch then it is yours, you won it and you can adapt it however you like. > Lewis > > [0] https://www.gnip.com/ >

