Hi Tejas, Thank you for detailed information. For the checks, Check 1 - can the url be fetched via wget command :
ubuntu@ubuntu-OptiPlex-390:~$ wget http://localhost:8080/nutch-test-site/child-1.html --2012-12-18 16:07:34-- http://localhost:8080/nutch-test-site/child-1.html Resolving localhost (localhost)... 127.0.0.1 Connecting to localhost (localhost)|127.0.0.1|:8080... connected. HTTP request sent, awaiting response... 200 OK Length: 102 [text/html] Saving to: `child-1.html.1' 100%[======================================>] 102 --.-K/s in 0s 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102] Check 2 : what are the robots rules defined for the host ? Do they allow the crawler to crawl that url ? this will address #5. Robot rules? I didn't get this check. Did you mean any setting in nutch-site xml ? 3. After changing the parent page url from IP based to localhost and running a *fresh* crawl, did you see any error or exception in the logs ? try running fresh crawl in local mode, its helps in debugging things quickly. Did a fresh crawl. There are no errors only warnings. The stats is same as above. configuration : regexurl-filter.txt has "+." and urls/seed.txt has http://localhost:8080/nutch-test-site/child-1.html Also important observation is when I set other sites for crawling like http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed to solr. But when I crawl the above html page nothing is fetched. Also when I am trying to crawl the site: http://rajinimaski.blogspot.in/ (this has 3 blogs) there is 403 status - failed to fetch. thanks & Regards Rajani On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]>wrote: > Hi Rajani, > > A url is marked as "db_gone" when nutch receives below HTTP error codes for > the request: > 1. Bad request (error code: 400) > 2. Not found (error code: 404) > 3. Access denied (error code: 401) > 4. Permanently gone (error code: 410) > > Apart from this, a url can also be marked as "db_gone" if: > 5. its not getting crawled due to "Robots denied" or > 6. some exception is triggered while fetching the content from the server > (eg. Read time out, Broken socket etc.) > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus on > HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to > avoid discussing that.) > > The reason why you could not see the child pages in the crawldb: because > the parent page has not been fetched successfully. > > Quick checks that you can try: > 1. can the url be fetched via wget command > <http://linux.die.net/man/1/wget>on the terminal ? this will address > cases 1-4 > 2. what are the robots rules defined for the host ? Do they allow the > crawler to crawl that url ? this will address #5. > 3. After changing the parent page url from IP based to localhost and > running a *fresh* crawl, did you see any error or exception in the logs ? > try running fresh crawl in local mode, its helps in debugging things > quickly. > > Thanks, > Tejas Patil > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected] > >wrote: > > > Can you please tell me what does this mean : Status: 3 (db_gone) >

