Hey guys,

I am working on a new SAAS product regarding website metrics.
There are some basic things, I need to achive and I ask myself, how
easy/tough it is, to achieve this with Nutch:

- Once we add a new customer domain, the crawler needs to crawl it until
there are no new links (don´t crawl links from another domain, subdomains
should be ok)
- We need to be able to trigger crawls for specific domains only
(recrawling for a specific customer)
- We need to evaluate metrics like XPath expressions or selectors similar
to jQuery selectors and store them along the raw content
- We need to archive the content of each HTML Page -> If we add new
metrics, we want to evaluate them to previous versions of the page
- We need to trigger an aggregation job after fetching and analysing of
individual pages has finished

I think these were the most important parts. What do you guys think? Is
this doable?

With kind regards,

Daniel

--

Wombat Software Technologies UG (haftungsbeschränkt)
Im MediaPark 5
D-50670 Köln

Geschäftsführer: Daniel Sachse, Jacob Pawlik
Unternehmenssitz: Köln
Handelsregister beim Amtsgericht: Köln
Handelsregister-Nummer: HRB 79316

Web: http://www.wombatsoftware.de
Email: [email protected]
Tel.: 0221/16905638
Mobil: 01578/4922886

Reply via email to