Hi, new to Nutch I managed to run the tutorial and have a first clue on how Nutch functions. First of all I would want to say 'thank you' to all the contributors to this great asset made available freely. However I am insecure on how to continue and therefore seek community advice on the scale of conceptual problem resolution.
My situation is the following: - I have got a handful of sites (seeds), all in the same language - Alltogether there are ~10.000 documents of interest (short MSWords and PDFs, some HTML) 'somewhere' on these sites/pages - These documents are to be made available through a web based search - Reaching the docs of interest often requires human interaction with the site such as drilling through a taxonomy which in turn leads to dynamical URLs such as '../download.php?dnlpid=1373&dnlpvs=1&dnlaid=51&fno=1&ref=showproddtl&refqy=item%3D1373%26grp%3D101' - The (large) remainder of the site content I am able to 'visually' sort out (by human inspection) - Documents change rather infrequently; most of all there are 'version' changes of the 'same' document - Sites themselves rarely change - Site (and content) owners are all cooperative - Site owners have few resources and technical knowledge to adapt their sites/backends/feeds/organisations to my needs My questions are the following: 0) How realistic is it to gather say 95% of the desired resources in a somewhat robust manner by a crawling approach? 1) Is Nutch the right software basis taking into consideration the situation from above? Frankly, I only chose it because I am being familiar with Lucene/Solr. 2) Should I spend effort to explore into alternative tooling such as Heritrix? Is there anybody out there with a similar 'set up' that could share a thought or give advice? 3) Is it realistic to sort out undesired content applying 'regexps' in crawlfilters on typical 'home grown', 'shop like' php pages? 4) How can I scrape parts of pages? When would I do this? As a postfilter before indexing? Introduce URL specific content parsers? 5) How do I best cope with 'dynamic' URLs. Is this feasible at all? Going forward my target is: - I would want to manage few additional metadata for each document. The metadata is likely to be administered in an RDBMS. - For example I am thinking of accounting for a 'number of recommendations' per document - This metadata I would want to exploit in the search using a custom field with a respective boost - I would want to follow up on document evolution based on url, anchor or title (or even a doc content similarity measure) Hence the following additinonal question arise: 6) Provided I commanded the code base where would I best hook into it in order to add a single custom field to the index? 7) I understand with 2.0 the crawldb coud be backed by an RDBMS. Is this true for the segments and the linkdb alike? Will it be one single schema? 8) Alledgedly usage of an RDBMS easens 're-crawl' ops in small crawl scenarios. Which exactly please? Currently for the sake of easiness I scrape everything before subsequent crawl. 9) Where would I hook into if say I wanted to account for a change in the MD5 signature of the same URL? Thank you all for any line of feedback, MHM

