Parallel indexing, maybe tokenizing, maybe rate limiting

Spencer Portee Thu, 10 Jun 2010 16:27:59 -0700

Hi,

I have a problem I'm trying to solve.  It boils down to taking web pages
(not the associated images and other things), and storing the result.


- We aren't using HDFS, but other NoSQL systems available for
distributed storage.  If the HTML was saved using a key of the url,
retrieval becomes easy
- Parallel indexing is desired, but we need to be nice to the sites
we're indexing.  We can't have any 1 site being index have more  than 2
or so connections opened to it.  Imagine 10k pages and in parallel, 10k
requests went out to their cluster.  Oof.
- We're doing some custom processing and may want to query the system
for specific urls for, the pure html in the original format, or go so
far as doing some inline processing (e.x. give me all H1 tags, or word
count) and getting back that result.

In an ideal world, a squid cluster would have 2 or 3 outbound connection
per site, nutch would store the html documents and I could query them
back url by url.  It's a very specific request, so I imagine nutch and
other technologies sit somewhere in the middle. 

Any place I can look towards for more info?

Cheers,
-spencer p

Parallel indexing, maybe tokenizing, maybe rate limiting

Reply via email to