This sounds most like non-existence of robots.txt on the webserver. Lewis
On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <[email protected]> wrote: > Hi Tejas, > > I found out the reason for why the blog was not getting crawled : > http://rajinimaski.blogspot.in/ > This is because of the proxy that has filter(block) for blog sites. Used > different IP and > Now I am able to crawl the above blog site successfully. > > However the html files that I have put in local tomcat webserver are not > getting crawled and there are no errors also. attached is the log file and > html sample pages.I will look at the robot rules for this and get back. > > Thanks very much > Regards > Rajani > > > > > > On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <[email protected]>wrote: > >> Hi Rajani, >> >> *Robot rules? I didn't get this check. Did you mean any setting in >> nutch-site >> xml ?* >> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard >> >> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my >> end. >> Without any error or exception its hard to tell issue. Set the logger to >> TRACE or DEBUG and see the logs created for the fetch phase. >> There must be some message regarding the url like >> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http >> code=403, >> url=http://www.abcd.edu/~pqr/homework.html >> or >> 2012-12-18 11:24:58,436 TRACE http.Http - fetching >> http://www.ics.uci.edu/~dan/class/260/notes/ >> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from >> http://www.ics.uci.edu/~dan/class/260/notes/ >> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required >> >> or something else that can shed the light on the issue. >> >> Thanks, >> Tejas Patil >> >> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]> >> wrote: >> >> > Hi Tejas, >> > Thank you for detailed information. For the checks, >> > >> > Check 1 - can the url be fetched via wget command : >> > >> > ubuntu@ubuntu-OptiPlex-390:~$ wget >> > http://localhost:8080/nutch-test-site/child-1.html >> > --2012-12-18 16:07:34-- >> > http://localhost:8080/nutch-test-site/child-1.html >> > Resolving localhost (localhost)... 127.0.0.1 >> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected. >> > HTTP request sent, awaiting response... 200 OK >> > Length: 102 [text/html] >> > Saving to: `child-1.html.1' >> > >> > 100%[======================================>] 102 --.-K/s in >> 0s >> > >> > >> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102] >> > >> > Check 2 : what are the robots rules defined for the host ? Do they allow >> > the >> > crawler to crawl that url ? this will address #5. >> > Robot rules? I didn't get this check. Did you mean any setting in >> > nutch-site xml ? >> > >> > 3. After changing the parent page url from IP based to localhost and >> > running a *fresh* crawl, did you see any error or exception in the logs >> ? >> > try running fresh crawl in local mode, its helps in debugging things >> > quickly. >> > >> > Did a fresh crawl. There are no errors only warnings. The stats is same >> as >> > above. >> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has >> > http://localhost:8080/nutch-test-site/child-1.html >> > >> > Also important observation is when I set other sites for crawling like >> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and >> indexed >> > to >> > solr. But when I crawl the above html page nothing is fetched. Also >> when I >> > am trying to crawl the site: http://rajinimaski.blogspot.in/ (this >> has 3 >> > blogs) there is 403 status - failed to fetch. >> > >> > >> > thanks & Regards >> > Rajani >> > >> > >> > >> > >> > >> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected] >> > >wrote: >> > >> > > Hi Rajani, >> > > >> > > A url is marked as "db_gone" when nutch receives below HTTP error >> codes >> > for >> > > the request: >> > > 1. Bad request (error code: 400) >> > > 2. Not found (error code: 404) >> > > 3. Access denied (error code: 401) >> > > 4. Permanently gone (error code: 410) >> > > >> > > Apart from this, a url can also be marked as "db_gone" if: >> > > 5. its not getting crawled due to "Robots denied" or >> > > 6. some exception is triggered while fetching the content from the >> server >> > > (eg. Read time out, Broken socket etc.) >> > > >> > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus >> on >> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I >> preferred >> > to >> > > avoid discussing that.) >> > > >> > > The reason why you could not see the child pages in the crawldb: >> because >> > > the parent page has not been fetched successfully. >> > > >> > > Quick checks that you can try: >> > > 1. can the url be fetched via wget command >> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address >> > > cases 1-4 >> > > 2. what are the robots rules defined for the host ? Do they allow the >> > > crawler to crawl that url ? this will address #5. >> > > 3. After changing the parent page url from IP based to localhost and >> > > running a *fresh* crawl, did you see any error or exception in the >> logs ? >> > > try running fresh crawl in local mode, its helps in debugging things >> > > quickly. >> > > >> > > Thanks, >> > > Tejas Patil >> > > >> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected] >> > > >wrote: >> > > >> > > > Can you please tell me what does this mean : Status: 3 (db_gone) >> > > >> > >> > > -- *Lewis*

