Additionally, wget is great for fetching pages on the fly, but it does not necessarily meant that your Nutch server will and/or should be able to fetch the page.
I would always recommend using the parserchecker [0] tool for on the fly fetching and parser checking. It can be run from the command line very easily. hth Lewis [0] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java On Wed, Dec 19, 2012 at 1:20 PM, Lewis John Mcgibbney < [email protected]> wrote: > This sounds most like non-existence of robots.txt on the webserver. > > Lewis > > > On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <[email protected]>wrote: > >> Hi Tejas, >> >> I found out the reason for why the blog was not getting crawled : >> http://rajinimaski.blogspot.in/ >> This is because of the proxy that has filter(block) for blog sites. Used >> different IP and >> Now I am able to crawl the above blog site successfully. >> >> However the html files that I have put in local tomcat webserver are not >> getting crawled and there are no errors also. attached is the log file and >> html sample pages.I will look at the robot rules for this and get back. >> >> Thanks very much >> Regards >> Rajani >> >> >> >> >> >> On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <[email protected]>wrote: >> >>> Hi Rajani, >>> >>> *Robot rules? I didn't get this check. Did you mean any setting in >>> nutch-site >>> xml ?* >>> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard >>> >>> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my >>> end. >>> Without any error or exception its hard to tell issue. Set the logger to >>> TRACE or DEBUG and see the logs created for the fetch phase. >>> There must be some message regarding the url like >>> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http >>> code=403, >>> url=http://www.abcd.edu/~pqr/homework.html >>> or >>> 2012-12-18 11:24:58,436 TRACE http.Http - fetching >>> http://www.ics.uci.edu/~dan/class/260/notes/ >>> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from >>> http://www.ics.uci.edu/~dan/class/260/notes/ >>> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required >>> >>> or something else that can shed the light on the issue. >>> >>> Thanks, >>> Tejas Patil >>> >>> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]> >>> wrote: >>> >>> > Hi Tejas, >>> > Thank you for detailed information. For the checks, >>> > >>> > Check 1 - can the url be fetched via wget command : >>> > >>> > ubuntu@ubuntu-OptiPlex-390:~$ wget >>> > http://localhost:8080/nutch-test-site/child-1.html >>> > --2012-12-18 16:07:34-- >>> > http://localhost:8080/nutch-test-site/child-1.html >>> > Resolving localhost (localhost)... 127.0.0.1 >>> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected. >>> > HTTP request sent, awaiting response... 200 OK >>> > Length: 102 [text/html] >>> > Saving to: `child-1.html.1' >>> > >>> > 100%[======================================>] 102 --.-K/s in >>> 0s >>> > >>> > >>> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102] >>> > >>> > Check 2 : what are the robots rules defined for the host ? Do they >>> allow >>> > the >>> > crawler to crawl that url ? this will address #5. >>> > Robot rules? I didn't get this check. Did you mean any setting in >>> > nutch-site xml ? >>> > >>> > 3. After changing the parent page url from IP based to localhost and >>> > running a *fresh* crawl, did you see any error or exception in the >>> logs ? >>> > try running fresh crawl in local mode, its helps in debugging things >>> > quickly. >>> > >>> > Did a fresh crawl. There are no errors only warnings. The stats is >>> same as >>> > above. >>> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has >>> > http://localhost:8080/nutch-test-site/child-1.html >>> > >>> > Also important observation is when I set other sites for crawling like >>> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and >>> indexed >>> > to >>> > solr. But when I crawl the above html page nothing is fetched. Also >>> when I >>> > am trying to crawl the site: http://rajinimaski.blogspot.in/ (this >>> has 3 >>> > blogs) there is 403 status - failed to fetch. >>> > >>> > >>> > thanks & Regards >>> > Rajani >>> > >>> > >>> > >>> > >>> > >>> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected] >>> > >wrote: >>> > >>> > > Hi Rajani, >>> > > >>> > > A url is marked as "db_gone" when nutch receives below HTTP error >>> codes >>> > for >>> > > the request: >>> > > 1. Bad request (error code: 400) >>> > > 2. Not found (error code: 404) >>> > > 3. Access denied (error code: 401) >>> > > 4. Permanently gone (error code: 410) >>> > > >>> > > Apart from this, a url can also be marked as "db_gone" if: >>> > > 5. its not getting crawled due to "Robots denied" or >>> > > 6. some exception is triggered while fetching the content from the >>> server >>> > > (eg. Read time out, Broken socket etc.) >>> > > >>> > > (NOTE: as we are dealing with a HTTP url here, it made sense to >>> focus on >>> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I >>> preferred >>> > to >>> > > avoid discussing that.) >>> > > >>> > > The reason why you could not see the child pages in the crawldb: >>> because >>> > > the parent page has not been fetched successfully. >>> > > >>> > > Quick checks that you can try: >>> > > 1. can the url be fetched via wget command >>> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address >>> > > cases 1-4 >>> > > 2. what are the robots rules defined for the host ? Do they allow the >>> > > crawler to crawl that url ? this will address #5. >>> > > 3. After changing the parent page url from IP based to localhost and >>> > > running a *fresh* crawl, did you see any error or exception in the >>> logs ? >>> > > try running fresh crawl in local mode, its helps in debugging things >>> > > quickly. >>> > > >>> > > Thanks, >>> > > Tejas Patil >>> > > >>> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski < >>> [email protected] >>> > > >wrote: >>> > > >>> > > > Can you please tell me what does this mean : Status: 3 (db_gone) >>> > > >>> > >>> >> >> > > > -- > *Lewis* > -- *Lewis*

