Re: Crawling localhost Webapps - regex- urfilter query

Tejas Patil Tue, 18 Dec 2012 13:18:40 -0800

Hi Rajani,

*Robot rules? I didn't get this check. Did you mean any setting in nutch-site
xml ?*
No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard


I was able to crawl http://rajinimaski.blogspot.in/ successfully at my end.
Without any error or exception its hard to tell issue. Set the logger to
TRACE or DEBUG and see the logs created for the fetch phase.
There must be some message regarding the url like
fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http code=403,
url=http://www.abcd.edu/~pqr/homework.html
or
2012-12-18 11:24:58,436 TRACE http.Http - fetching
http://www.ics.uci.edu/~dan/class/260/notes/
2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
http://www.ics.uci.edu/~dan/class/260/notes/
2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required

or something else that can shed the light on the issue.

Thanks,
Tejas Patil

On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]> wrote:

> Hi Tejas,
> Thank you for detailed information. For the checks,
>
> Check 1  - can the url be fetched via wget command :
>
> ubuntu@ubuntu-OptiPlex-390:~$ wget
> http://localhost:8080/nutch-test-site/child-1.html
> --2012-12-18 16:07:34--
> http://localhost:8080/nutch-test-site/child-1.html
> Resolving localhost (localhost)... 127.0.0.1
> Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 102 [text/html]
> Saving to: `child-1.html.1'
>
> 100%[======================================>] 102         --.-K/s   in 0s
>
>
> 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>
> Check 2 : what are the robots rules defined for the host ? Do they allow
> the
> crawler to crawl that url ? this will address #5.
> Robot rules? I didn't get this check. Did you mean any setting in
> nutch-site xml ?
>
> 3. After changing the parent page url from IP based to localhost and
> running a *fresh* crawl, did you see any error or exception in the logs ?
> try running fresh crawl in local mode, its helps in debugging things
> quickly.
>
> Did a fresh crawl. There are no errors only warnings. The stats is same as
> above.
> configuration : regexurl-filter.txt has "+." and urls/seed.txt has
> http://localhost:8080/nutch-test-site/child-1.html
>
> Also important observation is when I set other sites for crawling like
> http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed
> to
> solr. But when I crawl the above html page nothing is fetched. Also when I
> am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has 3
> blogs) there is 403 status - failed to fetch.
>
>
> thanks & Regards
> Rajani
>
>
>
>
>
> On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]
> >wrote:
>
> > Hi Rajani,
> >
> > A url is marked as "db_gone" when nutch receives below HTTP error codes
> for
> > the request:
> > 1. Bad request (error code: 400)
> > 2. Not found (error code: 404)
> > 3. Access denied (error code: 401)
> > 4. Permanently gone (error code: 410)
> >
> > Apart from this, a url can also be marked as "db_gone" if:
> > 5. its not getting crawled due to "Robots denied" or
> > 6. some exception is triggered while fetching the content from the server
> > (eg. Read time out, Broken socket etc.)
> >
> > (NOTE: as we are dealing with a HTTP url here, it made sense to focus on
> > HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred
> to
> > avoid discussing that.)
> >
> > The reason why you could not see the child pages in the crawldb: because
> > the parent page has not been fetched successfully.
> >
> > Quick checks that you can try:
> > 1. can the url be fetched via wget command
> > <http://linux.die.net/man/1/wget>on the terminal ? this will address
> > cases 1-4
> > 2. what are the robots rules defined for the host ? Do they allow the
> > crawler to crawl that url ? this will address #5.
> > 3. After changing the parent page url from IP based to localhost and
> > running a *fresh* crawl, did you see any error or exception in the logs ?
> > try running fresh crawl in local mode, its helps in debugging things
> > quickly.
> >
> > Thanks,
> > Tejas Patil
> >
> > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]
> > >wrote:
> >
> > >  Can you please tell me what does this mean : Status: 3 (db_gone)
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to