On 2010-06-10 22:56, Spencer Portee wrote: > Hi, > > I have a problem I'm trying to solve. It boils down to taking web pages > (not the associated images and other things), and storing the result. > > - We aren't using HDFS, but other NoSQL systems available for > distributed storage. If the HTML was saved using a key of the url, > retrieval becomes easy
Until we implement an ORM layer (see the discussions on NutchBase and the Gora project) this will be very difficult. In the meantime, a slightly easier to implement strategy would be to wrap a Hadoop FileSystem API on top of your KV store. > - Parallel indexing is desired, but we need to be nice to the sites > we're indexing. We can't have any 1 site being index have more than 2 > or so connections opened to it. Imagine 10k pages and in parallel, 10k > requests went out to their cluster. Oof. Nutch Fetcher already takes care of this. > - We're doing some custom processing and may want to query the system > for specific urls for, the pure html in the original format, or go so > far as doing some inline processing (e.x. give me all H1 tags, or word > count) and getting back that result. This can be accomplished today by using the bin/nutch read* tools, and the corresponding API - take a look at how these tools are implemented today, they allow dumping all content (raw, parsed, outlinks, etc), or select only individual records by URL. > > In an ideal world, a squid cluster would have 2 or 3 outbound connection > per site, nutch would store the html documents and I could query them > back url by url. It's a very specific request, so I imagine nutch and > other technologies sit somewhere in the middle. See above - you can do it now. Specifically, 'bin/nutch readseg ...' will give you the raw HTML by URL. > > Any place I can look towards for more info? Nutch Wiki, tutorial, and perhaps this slideset: http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

