Hi Rajani, A url is marked as "db_gone" when nutch receives below HTTP error codes for the request: 1. Bad request (error code: 400) 2. Not found (error code: 404) 3. Access denied (error code: 401) 4. Permanently gone (error code: 410)
Apart from this, a url can also be marked as "db_gone" if: 5. its not getting crawled due to "Robots denied" or 6. some exception is triggered while fetching the content from the server (eg. Read time out, Broken socket etc.) (NOTE: as we are dealing with a HTTP url here, it made sense to focus on HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to avoid discussing that.) The reason why you could not see the child pages in the crawldb: because the parent page has not been fetched successfully. Quick checks that you can try: 1. can the url be fetched via wget command <http://linux.die.net/man/1/wget>on the terminal ? this will address cases 1-4 2. what are the robots rules defined for the host ? Do they allow the crawler to crawl that url ? this will address #5. 3. After changing the parent page url from IP based to localhost and running a *fresh* crawl, did you see any error or exception in the logs ? try running fresh crawl in local mode, its helps in debugging things quickly. Thanks, Tejas Patil On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]>wrote: > Can you please tell me what does this mean : Status: 3 (db_gone)

