Re: Two Nutch parallel crawl with two conf folder.
On Tue, 9 Mar 2010 14:36:33 +0100 MilleBii mille...@gmail.com wrote: Never tried... Also you may want to check $NUTCH_HOME variable which should be different for each instance, otherwise it will only use one of the two conf dir. [...] Had meant to reply to the original poster, but had forgotten. We have indeed run multiple instances of Nutch in separate directories, without any problems. I presume that you are using the crawl.sh script, or a derivative of it. If so, as pointed out above, a likely cause of what you are seeing is that the NUTCH_HOME variable in the script is set to the same directory, so that the configuration from that directory is the one picked up. Regards, Gora
Re: Deleting stale URLs from Nutch/Solr
On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. ./bin/nutch readdb crawl/crawldb -stats shows me: CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 5 retry 0:5 min score: 0.333 avg score: 0.4664 max score: 1.0 status 2 (db_fetched): 5 CrawlDb statistics: done Regards, Gora
Re: Deleting stale URLs from Nutch/Solr
On Tue, 27 Oct 2009 07:29:10 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. Yes, it was indeed stopping at the generate step, and your explanation makes sense. You can pretend that the time moved on using the -adddays parameter. [...] Thanks. This worked exactly as you said. I have tested this, and the removed page indeed shows up with status db_gone, and I can now script a solution for my problem with stale URLs, along the lines that you have suggested. Thank you very much for this quick and thorough response. As I imagine that this is a common requirement, I will write up a brief blog entry on this by the weekend, along with a solution. Regards, Gora
Deleting stale URLs from Nutch/Solr
Hi, We are using Nutch to crawl an internal site, and index content to Solr. The issue is that the site is run through a CMS, and occasionally pages are deleted, so that the corresponding URLs become invalid. Is there any way that Nutch can discover stale URLs during recrawls, or is the only solution a completely fresh crawl? Also, is it possible to have Nutch automatically remove such stale content from Solr? I am stumped by this problem, and would appreciate any pointers, or even thoughts on this. Regards, Gora
Re: indexing just certain content
On Fri, 9 Oct 2009 18:00:41 +0200 MilleBii mille...@gmail.com wrote: Don't think it will work because at the indexing filter stage all the HTML tags are gone from the text. I think you need to modify the HTML parser to filter out the tags you want to get rid of. In some use case I have I would like to perform 'intelligent indexing', ie use the tag information to extract specific fields to be indexed along with the main text. A reverse case of yours. Todate I did not find a way to do it. So if you find a solution I'm with you. [...] This is something that we would also be interested in. Actually, we even have a working solution to extract content from between start/stop tags, written by our colleagues from a partner company. There are a couple of things that we would like to fix with this solution: (a) It directly modifies HtmlParser.java, which is obviously unmaintainable. (b) It is a solution for specific tags, rather than picking them up from configuration parameters. (c) We have not yet traced the complete execution path for Nutch, i.e., when is the parser called, when are filters called, etc. Is there a document anywhere about this? We were thinking of a filter, but from what you say above, that is the wrong stage. (d) Ideally, whatever solution we come up with would be contributed back to Nutch, which also helps us from a maintenance standpoint. Is there a defined process for getting external plugins accepted into Nutch? We are willing to put in some time into this, starting the coming week. Where can we start a brainstorming Wiki for this? Is the Nutch Wiki the right place? Regards, Gora