Hi Spencer

> - We aren't using HDFS, but other NoSQL systems available for
> distributed storage.  If the HTML was saved using a key of the url,
> retrieval becomes easy
>

This will be a feature of Nutch 2.0 which will use GORA
http://github.com/enis/gora* *as a front end to all sorts of NoSQL (and SQL)
backends.


> - Parallel indexing is desired, but we need to be nice to the sites
> we're indexing.  We can't have any 1 site being index have more  than 2
> or so connections opened to it.  Imagine 10k pages and in parallel, 10k
> requests went out to their cluster.  Oof.
>

That's already present in Nutch, you can control how frequently the hosts
are hit by the Fetcher


> - We're doing some custom processing and may want to query the system
> for specific urls for, the pure html in the original format, or go so
> far as doing some inline processing (e.x. give me all H1 tags, or word
> count) and getting back that result.
>
> In an ideal world, a squid cluster would have 2 or 3 outbound connection
> per site, nutch would store the html documents and I could query them
> back url by url.  It's a very specific request, so I imagine nutch and
> other technologies sit somewhere in the middle.


Once everything (original content, metadata, text) is stored in the webtable
you could process it in pretty much anyway you want, including with
MapReduce.
Once option could also be to write your own indexers and do your custom
processing there.


>
> Any place I can look towards for more info?
>

http://github.com/dogacan/nutchbase - which is the basis for Nutch 2.0
http://github.com/enis/gora

Both are still in beta stage but fit well with what you described.

Another option could be to stick to the Nutch 1.1 branch and write a custom
indexer which would send the document representation directly to your NoSQL
backend (via Gora or not). However this sounds like a perfect demonstration
of the flexibility that Nutch 2.0 will provide.

Here are 2 talks that Andrzej and I gave at Berlin Buzzwords last week and
are related to using Nutch as an input to custom analysis

Nutch as a web mining platform - the present and the future
http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/bialecki_bbuzz2010.pdf

Behemoth - a Hadoop based platform for large scale document processing
http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

HTH

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to