I’m responding here in some areas where I’ve done something similar to what you 
need.

On Jul 2, 2014, at 12:34 PM, Daniel Sachse <[email protected]> wrote:

> Hey guys,
> 
> I am working on a new SAAS product regarding website metrics.
> There are some basic things, I need to achive and I ask myself, how
> easy/tough it is, to achieve this with Nutch:
> 
> - Once we add a new customer domain, the crawler needs to crawl it until
> there are no new links (don´t crawl links from another domain, subdomains
> should be ok)
> - We need to be able to trigger crawls for specific domains only
> (recrawling for a specific customer)

This could be easily accomplished through nutch configuration with default 
plugins, you’ll need interact with this configuration but I don’t think it 
would be a problem.

> - We need to evaluate metrics like XPath expressions or selectors similar
> to jQuery selectors and store them along the raw content
> - We need to archive the content of each HTML Page -> If we add new
> metrics, we want to evaluate them to previous versions of the page

You’ll need to write one or several parse plugins to accomplish this 2 items, 
it wouldn’t be too hard to came up with some basic implementation that will let 
you add your custom parse logic. i.e parse meaning extracting those parts of 
the webpage that you’ll need. This it’s very easy and you could use the source 
code of the default provided plugins to see how you must implement a plugin, 
depending on what you need my advice is to check out those plugins that has a 
reasonably amount of similarity with your goals. 

> - We need to trigger an aggregation job after fetching and analysing of
> individual pages has finished.

I don’t understand what you need to accomplish here, what this aggregation job 
will do? 

> 
> I think these were the most important parts. What do you guys think? Is
> this doable?
> 
> With kind regards,
> 
> Daniel

Hope this helps,

Greetings,

> 
> --
> 
> Wombat Software Technologies UG (haftungsbeschränkt)
> Im MediaPark 5
> D-50670 Köln
> 
> Geschäftsführer: Daniel Sachse, Jacob Pawlik
> Unternehmenssitz: Köln
> Handelsregister beim Amtsgericht: Köln
> Handelsregister-Nummer: HRB 79316
> 
> Web: http://www.wombatsoftware.de
> Email: [email protected]
> Tel.: 0221/16905638
> Mobil: 01578/4922886

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Reply via email to