Re: Crawling localhost Webapps - regex- urfilter query

Rajani Maski Tue, 18 Dec 2012 03:36:47 -0800

Hi Tejas,
Thank you for detailed information. For the checks,

Check 1  - can the url be fetched via wget command :

ubuntu@ubuntu-OptiPlex-390:~$ wget
http://localhost:8080/nutch-test-site/child-1.html
--2012-12-18 16:07:34--  http://localhost:8080/nutch-test-site/child-1.html
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102 [text/html]
Saving to: `child-1.html.1'

100%[======================================>] 102         --.-K/s   in 0s

2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]

Check 2 : what are the robots rules defined for the host ? Do they allow the
crawler to crawl that url ? this will address #5.
Robot rules? I didn't get this check. Did you mean any setting in
nutch-site xml ?

3. After changing the parent page url from IP based to localhost and
running a *fresh* crawl, did you see any error or exception in the logs ?
try running fresh crawl in local mode, its helps in debugging things
quickly.

Did a fresh crawl. There are no errors only warnings. The stats is same as
above.
configuration : regexurl-filter.txt has "+." and urls/seed.txt has
http://localhost:8080/nutch-test-site/child-1.html

Also important observation is when I set other sites for crawling like
http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed to
solr. But when I crawl the above html page nothing is fetched. Also when I
am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has 3
blogs) there is 403 status - failed to fetch.

thanks & Regards
Rajani

On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]>wrote:

> Hi Rajani,
>
> A url is marked as "db_gone" when nutch receives below HTTP error codes for
> the request:
> 1. Bad request (error code: 400)
> 2. Not found (error code: 404)
> 3. Access denied (error code: 401)
> 4. Permanently gone (error code: 410)
>
> Apart from this, a url can also be marked as "db_gone" if:
> 5. its not getting crawled due to "Robots denied" or
> 6. some exception is triggered while fetching the content from the server
> (eg. Read time out, Broken socket etc.)
>
> (NOTE: as we are dealing with a HTTP url here, it made sense to focus on
> HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to
> avoid discussing that.)
>
> The reason why you could not see the child pages in the crawldb: because
> the parent page has not been fetched successfully.
>
> Quick checks that you can try:
> 1. can the url be fetched via wget command
> <http://linux.die.net/man/1/wget>on the terminal ? this will address
> cases 1-4
> 2. what are the robots rules defined for the host ? Do they allow the
> crawler to crawl that url ? this will address #5.
> 3. After changing the parent page url from IP based to localhost and
> running a *fresh* crawl, did you see any error or exception in the logs ?
> try running fresh crawl in local mode, its helps in debugging things
> quickly.
>
> Thanks,
> Tejas Patil
>
> On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]
> >wrote:
>
> >  Can you please tell me what does this mean : Status: 3 (db_gone)
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to