Hi Spencer
> - We aren't using HDFS, but other NoSQL systems available for > distributed storage. If the HTML was saved using a key of the url, > retrieval becomes easy > This will be a feature of Nutch 2.0 which will use GORA http://github.com/enis/gora* *as a front end to all sorts of NoSQL (and SQL) backends. > - Parallel indexing is desired, but we need to be nice to the sites > we're indexing. We can't have any 1 site being index have more than 2 > or so connections opened to it. Imagine 10k pages and in parallel, 10k > requests went out to their cluster. Oof. > That's already present in Nutch, you can control how frequently the hosts are hit by the Fetcher > - We're doing some custom processing and may want to query the system > for specific urls for, the pure html in the original format, or go so > far as doing some inline processing (e.x. give me all H1 tags, or word > count) and getting back that result. > > In an ideal world, a squid cluster would have 2 or 3 outbound connection > per site, nutch would store the html documents and I could query them > back url by url. It's a very specific request, so I imagine nutch and > other technologies sit somewhere in the middle. Once everything (original content, metadata, text) is stored in the webtable you could process it in pretty much anyway you want, including with MapReduce. Once option could also be to write your own indexers and do your custom processing there. > > Any place I can look towards for more info? > http://github.com/dogacan/nutchbase - which is the basis for Nutch 2.0 http://github.com/enis/gora Both are still in beta stage but fit well with what you described. Another option could be to stick to the Nutch 1.1 branch and write a custom indexer which would send the document representation directly to your NoSQL backend (via Gora or not). However this sounds like a perfect demonstration of the flexibility that Nutch 2.0 will provide. Here are 2 talks that Andrzej and I gave at Berlin Buzzwords last week and are related to using Nutch as an input to custom analysis Nutch as a web mining platform - the present and the future http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/bialecki_bbuzz2010.pdf Behemoth - a Hadoop based platform for large scale document processing http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp HTH Julien -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

