Re: Crawling localhost Webapps - regex- urfilter query

Tejas Patil Tue, 18 Dec 2012 00:30:18 -0800

Hi Rajani,

A url is marked as "db_gone" when nutch receives below HTTP error codes for
the request:
1. Bad request (error code: 400)
2. Not found (error code: 404)
3. Access denied (error code: 401)
4. Permanently gone (error code: 410)


Apart from this, a url can also be marked as "db_gone" if:
5. its not getting crawled due to "Robots denied" or
6. some exception is triggered while fetching the content from the server
(eg. Read time out, Broken socket etc.)

(NOTE: as we are dealing with a HTTP url here, it made sense to focus on
HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to
avoid discussing that.)

The reason why you could not see the child pages in the crawldb: because
the parent page has not been fetched successfully.

Quick checks that you can try:
1. can the url be fetched via wget command
<http://linux.die.net/man/1/wget>on the terminal ? this will address
cases 1-4
2. what are the robots rules defined for the host ? Do they allow the
crawler to crawl that url ? this will address #5.
3. After changing the parent page url from IP based to localhost and
running a *fresh* crawl, did you see any error or exception in the logs ?
try running fresh crawl in local mode, its helps in debugging things
quickly.

Thanks,
Tejas Patil

On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]>wrote:

>  Can you please tell me what does this mean : Status: 3 (db_gone)

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to