Hi, I have a problem I'm trying to solve. It boils down to taking web pages (not the associated images and other things), and storing the result.
- We aren't using HDFS, but other NoSQL systems available for distributed storage. If the HTML was saved using a key of the url, retrieval becomes easy - Parallel indexing is desired, but we need to be nice to the sites we're indexing. We can't have any 1 site being index have more than 2 or so connections opened to it. Imagine 10k pages and in parallel, 10k requests went out to their cluster. Oof. - We're doing some custom processing and may want to query the system for specific urls for, the pure html in the original format, or go so far as doing some inline processing (e.x. give me all H1 tags, or word count) and getting back that result. In an ideal world, a squid cluster would have 2 or 3 outbound connection per site, nutch would store the html documents and I could query them back url by url. It's a very specific request, so I imagine nutch and other technologies sit somewhere in the middle. Any place I can look towards for more info? Cheers, -spencer p

