Nutch suited for 'focused' resource aquisition?

Henrich Martin Thu, 06 Jan 2011 06:28:35 -0800

Hi,

new to Nutch I managed to run the tutorial and have a first clue on how
Nutch functions.
First of all I would want to say 'thank you' to all the contributors to this
great asset
made available freely. However I am insecure on how to continue and
therefore seek community
advice on the scale of conceptual problem resolution.


My situation is the following:

- I have got a handful of sites (seeds), all in the same language
- Alltogether there are ~10.000 documents of interest (short MSWords and
PDFs, some HTML) 'somewhere' on these sites/pages
- These documents are to be made available through a web based search
- Reaching the docs of interest often requires human interaction with the
site such as drilling through a taxonomy which in turn leads to dynamical
URLs
such as
'../download.php?dnlpid=1373&dnlpvs=1&dnlaid=51&fno=1&ref=showproddtl&refqy=item%3D1373%26grp%3D101'
- The (large) remainder of the site content I am able to 'visually' sort out
(by human inspection)
- Documents change rather infrequently; most of all there are 'version'
changes of the 'same' document
- Sites themselves rarely change
- Site (and content) owners are all cooperative
- Site owners have few resources and technical knowledge to adapt their
sites/backends/feeds/organisations to my needs


My questions are the following:

0) How realistic is it to gather say 95% of the desired resources in a
somewhat robust manner by a crawling approach?
1) Is Nutch the right software basis taking into consideration the situation
from above?
Frankly, I only chose it because I am being familiar with Lucene/Solr.
2) Should I spend effort to explore into alternative tooling such as
Heritrix?
Is there anybody out there with a similar 'set up' that could share a
thought or give advice?
3) Is it realistic to sort out undesired content applying 'regexps' in
crawlfilters on typical 'home grown', 'shop like' php pages?
4) How can I scrape parts of pages? When would I do this? As a postfilter
before indexing? Introduce URL specific content parsers?
5) How do I best cope with 'dynamic' URLs. Is this feasible at all?


Going forward my target is:

- I would want to manage few additional metadata for each document. The
metadata is likely to be administered in an RDBMS.
- For example I am thinking of accounting for a 'number of recommendations'
per document
- This metadata I would want to exploit in the search using a custom field
with a respective boost
- I would want to follow up on document evolution based on url, anchor or
title (or even a doc content similarity measure)

Hence the following additinonal question arise:

6) Provided I commanded the code base where would I best hook into it in
order to add a single custom field to the index?
7) I understand with 2.0 the crawldb coud be backed by an RDBMS. Is this
true for the segments and the linkdb alike? Will it be one single schema?
8) Alledgedly usage of an RDBMS easens 're-crawl' ops in small crawl
scenarios. Which exactly please? Currently for the sake of easiness I scrape
everything before subsequent crawl.
9) Where would I hook into if say I wanted to account for a change in the
MD5 signature of the same URL?


Thank you all for any line of feedback,
MHM

Nutch suited for 'focused' resource aquisition?

Reply via email to