Hi Tejas, Please find my replied embedded. Thank you for the reply and time.
> > *"status 1 (db_unfetched): 1"* means that url [1] is NOT crawled. > (FYI: it is not interpreted as "db_unfetched - status is 1". The number 1 > here indicates that there is 1 url in the crawldb with status as > db_unfetched.) > > You said that there are no exceptions in the log file. Which log file did > you see ? If you are running in the distributed mode, then you must see the hadoop > logs (on jobtracker) for the nutch jobs. > > It is a basic local set up. Downloaded the binary version of apache nutch 1.5.1 and followed the setup steps mentioned in wiki. The log file path is ../Downloads/apache-nutch-1.5. 1/logs/hadoop.log Note : I was using system IPaddress yesterday but there is some exception today for the same url (Exception :2012-12-18 12:42:04,558 INFO fetcher.Fetcher - fetch of http://43.44.111.123:8080/nutch-test-site/home.html failed with: java.net.SocketTimeoutException: Read timed out) So I changed it to localhost and now the stat is ubuntu@ubuntu-OptiPlex-390:~/Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -stats CrawlDb statistics start: crawlnewtest/crawldb Statistics for CrawlDb: crawlnewtest/crawldb TOTAL urls: 1 retry 0: 1 min score: 1.0 avg score: 1.0 max score: 1.0 status 3 (db_gone): 1 CrawlDb statistics: done > > Also, can you send the entry of the url [1] from the crawldb ? The command > is: > *bin/nutch readdb <path to the crawldb> -url <url>* > * > * > Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -url http://localhost:8080/nutch-test-site/home.html URL: http://localhost:8080/nutch-test-site/home.html Version: 7 Status: 3 (db_gone) Fetch time: Fri Feb 01 12:26:54 IST 2013 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: _pst_: gone(11), lastModified=0: http://localhost:8080/nutch-test-site/home.html Can you please tell me what does this mean : Status: 3 (db_gone) [ or could you point me to the reference link where I can know about what such response mean] If you are not able to get any output for above command, then get the dump > of whole crawldb using this command: > *bin/nutch readdb <path to the crawldb> -dump <output directory>** > /Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -dump /home/ubuntu/Downloads/apache-nutch-1.5.1/test CrawlDb dump: starting CrawlDb db: crawlnewtest/crawldb CrawlDb dump: done Output is : http://localhost:8080/nutch-test-site/home.html Version: 7 Status: 3 (db_gone) Fetch time: Fri Feb 01 12:26:54 IST 2013 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: _pst_: gone(11), lastModified=0: http://localhost:8080/nutch-test-site/home.html In the previous email of yours Point B - if db status is fetched, *B. The main url gets crawled successfully but the rest 2 child pages are not getting crawled.* If the url [1] is db_fetched, then use the same command "bin/nutch CrawlDbReader" to see the status of the rest 2 child pages. Can you please correct me if the command I am using is wrong because this gives me only one main url listed. Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -stats Thank you very much. Regards Rajani > thanks, > Tejas Patil > > On Mon, Dec 17, 2012 at 8:51 PM, Rajani Maski <[email protected]> > wrote: > > > status 1 (db_unfetched): 1 >

