Hi, Chris Anderson schrieb: > On Sun, Aug 9, 2009 at 8:20 AM, Julian Moritz<[email protected]> wrote: >> Hi there, >> >> I am very new to couchdb but highly interested. I work for the >> nlp-department of my university and maybe couchdb would be good choice >> for a search engine/web crawler storage. >> >> Is there a project wich implements such a thing as a couchdb? >> > > I first got into CouchDB using it as part of a web-spider. I used > Nutch / Hadoop to run the actual crawl (with depth=1, so it was merely > fetching all the URLs in a long list I'd give it) > > Then I'd use Hadoop to run a Ruby job over all the fetched pages, > which parsed the HTML / XML / mp3 etc, converting it into a JSON > document and putting it in CouchDB. > > Then I used CouchDB map reduce to find all the inlinks for each page, > and do various other kinds of analysis, as well as to find the list of > URLs that we learned about in the last crawl that we hadn't fetched > yet, for driving the next round of crawl. >
okay, as I wrote, I'm studying natural language processing. My department has made some experiences with crawling the web. let's break it down: 1st: you need bandwith. so crawling from a single computer is senseless, crawling from a single point is more or less useless, but a distributed application for crawling is not a problem, if you got enough people who use it. 2nd: you need even more storage. a high (horizontally) scalable database would be helpful. why couchdb for the 1st point? the client software could be written in any language, because data is sent to the storage via tcp. why couchdb for the 2nd point: well, isn't couchdb that what you need there? and for fast crawling you need a list with urls wich are randomly sorted. so you could extract every urls of every document and you should sort them by a random key (done with a view). this is _very_ important: fast crawling not sorted randomly would be a DoS attack on some sites. adding uniqueness to the list of urls would it make too slow with a big list. and for a fast search you need a special data structure called wordlist. in each line the first column is the word and the following columns are the documents wich contain the word. something like: house document_1 document_2 mouse document_2 document_3 wich could be done with a simple view. so everyone could spend some space for storing data, spend some bandwith for crawling and everyone could write his/her own website for searching the data hence they are represented as json via tcp. just my thoughts, maybe I'm totally wrong, then please correct me. Best regards Julian > You could do something like this a lot more simply with Disco and > CouchDB, I think, but you'd probably end up writing more of the code. > > Chris > > > >
