Hey guys, I am working on a new SAAS product regarding website metrics. There are some basic things, I need to achive and I ask myself, how easy/tough it is, to achieve this with Nutch:
- Once we add a new customer domain, the crawler needs to crawl it until there are no new links (don´t crawl links from another domain, subdomains should be ok) - We need to be able to trigger crawls for specific domains only (recrawling for a specific customer) - We need to evaluate metrics like XPath expressions or selectors similar to jQuery selectors and store them along the raw content - We need to archive the content of each HTML Page -> If we add new metrics, we want to evaluate them to previous versions of the page - We need to trigger an aggregation job after fetching and analysing of individual pages has finished I think these were the most important parts. What do you guys think? Is this doable? With kind regards, Daniel -- Wombat Software Technologies UG (haftungsbeschränkt) Im MediaPark 5 D-50670 Köln Geschäftsführer: Daniel Sachse, Jacob Pawlik Unternehmenssitz: Köln Handelsregister beim Amtsgericht: Köln Handelsregister-Nummer: HRB 79316 Web: http://www.wombatsoftware.de Email: [email protected] Tel.: 0221/16905638 Mobil: 01578/4922886

