Re: Parallel indexing, maybe tokenizing, maybe rate limiting

Andrzej Bialecki Mon, 14 Jun 2010 08:27:44 -0700

On 2010-06-10 22:56, Spencer Portee wrote:
> Hi,
> 
> I have a problem I'm trying to solve.  It boils down to taking web pages
> (not the associated images and other things), and storing the result.  
> 
> - We aren't using HDFS, but other NoSQL systems available for
> distributed storage.  If the HTML was saved using a key of the url,
> retrieval becomes easy


Until we implement an ORM layer (see the discussions on NutchBase and
the Gora project) this will be very difficult. In the meantime, a
slightly easier to implement strategy would be to wrap a Hadoop
FileSystem API on top of your KV store.

> - Parallel indexing is desired, but we need to be nice to the sites
> we're indexing.  We can't have any 1 site being index have more  than 2
> or so connections opened to it.  Imagine 10k pages and in parallel, 10k
> requests went out to their cluster.  Oof.

Nutch Fetcher already takes care of this.

> - We're doing some custom processing and may want to query the system
> for specific urls for, the pure html in the original format, or go so
> far as doing some inline processing (e.x. give me all H1 tags, or word
> count) and getting back that result.

This can be accomplished today by using the bin/nutch read* tools, and
the corresponding API - take a look at how these tools are implemented
today, they allow dumping all content (raw, parsed, outlinks, etc), or
select only individual records by URL.

> 
> In an ideal world, a squid cluster would have 2 or 3 outbound connection
> per site, nutch would store the html documents and I could query them
> back url by url.  It's a very specific request, so I imagine nutch and
> other technologies sit somewhere in the middle. 

See above - you can do it now. Specifically, 'bin/nutch readseg ...'
will give you the raw HTML by URL.

> 
> Any place I can look towards for more info?

Nutch Wiki, tutorial, and perhaps this slideset:

http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parallel indexing, maybe tokenizing, maybe rate limiting

Reply via email to