I’m responding here in some areas where I’ve done something similar to what you need.
On Jul 2, 2014, at 12:34 PM, Daniel Sachse <[email protected]> wrote: > Hey guys, > > I am working on a new SAAS product regarding website metrics. > There are some basic things, I need to achive and I ask myself, how > easy/tough it is, to achieve this with Nutch: > > - Once we add a new customer domain, the crawler needs to crawl it until > there are no new links (don´t crawl links from another domain, subdomains > should be ok) > - We need to be able to trigger crawls for specific domains only > (recrawling for a specific customer) This could be easily accomplished through nutch configuration with default plugins, you’ll need interact with this configuration but I don’t think it would be a problem. > - We need to evaluate metrics like XPath expressions or selectors similar > to jQuery selectors and store them along the raw content > - We need to archive the content of each HTML Page -> If we add new > metrics, we want to evaluate them to previous versions of the page You’ll need to write one or several parse plugins to accomplish this 2 items, it wouldn’t be too hard to came up with some basic implementation that will let you add your custom parse logic. i.e parse meaning extracting those parts of the webpage that you’ll need. This it’s very easy and you could use the source code of the default provided plugins to see how you must implement a plugin, depending on what you need my advice is to check out those plugins that has a reasonably amount of similarity with your goals. > - We need to trigger an aggregation job after fetching and analysing of > individual pages has finished. I don’t understand what you need to accomplish here, what this aggregation job will do? > > I think these were the most important parts. What do you guys think? Is > this doable? > > With kind regards, > > Daniel Hope this helps, Greetings, > > -- > > Wombat Software Technologies UG (haftungsbeschränkt) > Im MediaPark 5 > D-50670 Köln > > Geschäftsführer: Daniel Sachse, Jacob Pawlik > Unternehmenssitz: Köln > Handelsregister beim Amtsgericht: Köln > Handelsregister-Nummer: HRB 79316 > > Web: http://www.wombatsoftware.de > Email: [email protected] > Tel.: 0221/16905638 > Mobil: 01578/4922886 VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

