Hi Daniel, see inline -----Original message----- > From:Daniel Sachse <[email protected]> > Sent: Wednesday 2nd July 2014 18:35 > To: [email protected] > Subject: Feasibility questions regarding my new project > > Hey guys, > > I am working on a new SAAS product regarding website metrics. > There are some basic things, I need to achive and I ask myself, how > easy/tough it is, to achieve this with Nutch: > > - Once we add a new customer domain, the crawler needs to crawl it until > there are no new links (don´t crawl links from another domain, subdomains > should be ok)
That easy, just let Nutch run continously, and it will stop when exhausted and recrawl when it is time to do so. Better hope your sites do not contain spider traps or it will continue almost forever. You can mitigate them using manual regex filters or some special software that detects them. > - We need to be able to trigger crawls for specific domains only > (recrawling for a specific customer) You cannot trigger a recrawl for a specific host. It would need a modification of Nutch to reset the fetchTimes for all URL's matching a pattern or host/domain. Should be feasible to make. > - We need to evaluate metrics like XPath expressions or selectors similar > to jQuery selectors and store them along the raw content Nutch does not yet have that onboard but there are patches that can extract data via XPath. > - We need to archive the content of each HTML Page -> If we add new > metrics, we want to evaluate them to previous versions of the page You can do that with Nutch 1.x, it keeps all data stored (unless removed). It would allow you to do complex analysis on all pages and their history. If the crawl is large, you do need a powerful Hadoop cluster to do the math. > - We need to trigger an aggregation job after fetching and analysing of > individual pages has finished Well, normally one would index them to some back end, but you can aggregate the stuff and send it anywhere you want. Nutch 1.x now has pluggable indexing backend so it should forfill your needs. > > I think these were the most important parts. What do you guys think? Is > this doable? Sure :) > > With kind regards, > > Daniel > > -- > > Wombat Software Technologies UG (haftungsbeschränkt) > Im MediaPark 5 > D-50670 Köln > > Geschäftsführer: Daniel Sachse, Jacob Pawlik > Unternehmenssitz: Köln > Handelsregister beim Amtsgericht: Köln > Handelsregister-Nummer: HRB 79316 > > Web: http://www.wombatsoftware.de > Email: [email protected] > Tel.: 0221/16905638 > Mobil: 01578/4922886 >

