On Sun, Aug 9, 2009 at 8:20 AM, Julian Moritz<[email protected]> wrote: > Hi there, > > I am very new to couchdb but highly interested. I work for the > nlp-department of my university and maybe couchdb would be good choice > for a search engine/web crawler storage. > > Is there a project wich implements such a thing as a couchdb? >
I first got into CouchDB using it as part of a web-spider. I used Nutch / Hadoop to run the actual crawl (with depth=1, so it was merely fetching all the URLs in a long list I'd give it) Then I'd use Hadoop to run a Ruby job over all the fetched pages, which parsed the HTML / XML / mp3 etc, converting it into a JSON document and putting it in CouchDB. Then I used CouchDB map reduce to find all the inlinks for each page, and do various other kinds of analysis, as well as to find the list of URLs that we learned about in the last crawl that we hadn't fetched yet, for driving the next round of crawl. You could do something like this a lot more simply with Disco and CouchDB, I think, but you'd probably end up writing more of the code. Chris -- Chris Anderson http://jchrisa.net http://couch.io
