If you want to process logs, you don't need to use Nutch and since you are interested in storing it in Hadoop there are several log processors with Hadoop backend, Cloudera has one that I forgot the name but here is another one: http://incubator.apache.org/chukwa/docs/r0.3.0/design.html
On Mon, Apr 30, 2012 at 8:36 AM, Alex McLintock <[email protected]>wrote: > Hi Folks, > > This is not 100% a Nutch question... and I hate it when other people say "I > know my question is off topic....." so why I am doing it myself I don;t > know. > > I am looking at building a system similar to Google Analytics - in that it > logs page requests on third party sites using some kind of Javascript, does > processing on those logs, and produces reports. I see there are open source > tools for this which are MySQL/RDBMS backed - but I want a Hadoop backed > system for scalability. Do I just need to implement it myself or is anyone > working on such a thing? > > To bring this back to Nutch I would also like to fetch and index all the > pages which are logged in this way so that my system knows what they are > about. (But I don't really need any web crawling after that) > > Any ideas? > > Cheers >

