RE: Feasibility questions regarding my new project

Markus Jelsma Wed, 02 Jul 2014 13:51:25 -0700

Hi Daniel, see inline 
 
-----Original message-----
> From:Daniel Sachse <[email protected]>
> Sent: Wednesday 2nd July 2014 18:35
> To: [email protected]
> Subject: Feasibility questions regarding my new project
> 
> Hey guys,
> 
> I am working on a new SAAS product regarding website metrics.
> There are some basic things, I need to achive and I ask myself, how
> easy/tough it is, to achieve this with Nutch:
> 
> - Once we add a new customer domain, the crawler needs to crawl it until
> there are no new links (don´t crawl links from another domain, subdomains
> should be ok)


That easy, just let Nutch run continously, and it will stop when exhausted and 
recrawl when it is time to do so. Better hope your sites do not contain spider 
traps or it will continue almost forever. You can mitigate them using manual 
regex filters or some special software that detects them.

> - We need to be able to trigger crawls for specific domains only
> (recrawling for a specific customer)

You cannot trigger a recrawl for a specific host. It would need a modification 
of Nutch to reset the fetchTimes for all URL's matching a pattern or 
host/domain. Should be feasible to make.

> - We need to evaluate metrics like XPath expressions or selectors similar
> to jQuery selectors and store them along the raw content

Nutch does not yet have that onboard but there are patches that can extract 
data via XPath.

> - We need to archive the content of each HTML Page -> If we add new
> metrics, we want to evaluate them to previous versions of the page

You can do that with Nutch 1.x, it keeps all data stored (unless removed). It 
would allow you to do complex analysis on all pages and their history. If the 
crawl is large, you do need a powerful Hadoop cluster to do the math.

> - We need to trigger an aggregation job after fetching and analysing of
> individual pages has finished

Well, normally one would index them to some back end, but you can aggregate the 
stuff and send it anywhere you want. Nutch 1.x now has pluggable indexing 
backend so it should forfill your needs.

> 
> I think these were the most important parts. What do you guys think? Is
> this doable?

Sure :)

> 
> With kind regards,
> 
> Daniel
> 
> --
> 
> Wombat Software Technologies UG (haftungsbeschränkt)
> Im MediaPark 5
> D-50670 Köln
> 
> Geschäftsführer: Daniel Sachse, Jacob Pawlik
> Unternehmenssitz: Köln
> Handelsregister beim Amtsgericht: Köln
> Handelsregister-Nummer: HRB 79316
> 
> Web: http://www.wombatsoftware.de
> Email: [email protected]
> Tel.: 0221/16905638
> Mobil: 01578/4922886
>

RE: Feasibility questions regarding my new project

Reply via email to