Re: Crawling localhost Webapps - regex- urfilter query

Lewis John Mcgibbney Wed, 19 Dec 2012 05:21:02 -0800

This sounds most like non-existence of robots.txt on the webserver.

Lewis


On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <[email protected]> wrote:

> Hi Tejas,
>
>  I found out the reason for why the blog was not getting crawled :
> http://rajinimaski.blogspot.in/
> This is because of the proxy that has filter(block) for blog sites. Used
> different IP and
>  Now I am able to crawl the above blog site successfully.
>
> However the html files that I have put in local tomcat webserver are not
> getting crawled and there are no errors also. attached is the log file and
> html sample pages.I will look at the robot rules for this and get back.
>
> Thanks very much
> Regards
> Rajani
>
>
>
>
>
> On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <[email protected]>wrote:
>
>> Hi Rajani,
>>
>> *Robot rules? I didn't get this check. Did you mean any setting in
>> nutch-site
>> xml ?*
>> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>
>> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
>> end.
>> Without any error or exception its hard to tell issue. Set the logger to
>> TRACE or DEBUG and see the logs created for the fetch phase.
>> There must be some message regarding the url like
>> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
>> code=403,
>> url=http://www.abcd.edu/~pqr/homework.html
>> or
>> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
>> http://www.ics.uci.edu/~dan/class/260/notes/
>> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
>> http://www.ics.uci.edu/~dan/class/260/notes/
>> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>>
>> or something else that can shed the light on the issue.
>>
>> Thanks,
>> Tejas Patil
>>
>> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]>
>> wrote:
>>
>> > Hi Tejas,
>> > Thank you for detailed information. For the checks,
>> >
>> > Check 1  - can the url be fetched via wget command :
>> >
>> > ubuntu@ubuntu-OptiPlex-390:~$ wget
>> > http://localhost:8080/nutch-test-site/child-1.html
>> > --2012-12-18 16:07:34--
>> > http://localhost:8080/nutch-test-site/child-1.html
>> > Resolving localhost (localhost)... 127.0.0.1
>> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
>> > HTTP request sent, awaiting response... 200 OK
>> > Length: 102 [text/html]
>> > Saving to: `child-1.html.1'
>> >
>> > 100%[======================================>] 102         --.-K/s   in
>> 0s
>> >
>> >
>> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>> >
>> > Check 2 : what are the robots rules defined for the host ? Do they allow
>> > the
>> > crawler to crawl that url ? this will address #5.
>> > Robot rules? I didn't get this check. Did you mean any setting in
>> > nutch-site xml ?
>> >
>> > 3. After changing the parent page url from IP based to localhost and
>> > running a *fresh* crawl, did you see any error or exception in the logs
>> ?
>> > try running fresh crawl in local mode, its helps in debugging things
>> > quickly.
>> >
>> > Did a fresh crawl. There are no errors only warnings. The stats is same
>> as
>> > above.
>> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
>> > http://localhost:8080/nutch-test-site/child-1.html
>> >
>> > Also important observation is when I set other sites for crawling like
>> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and
>> indexed
>> > to
>> > solr. But when I crawl the above html page nothing is fetched. Also
>> when I
>> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this
>> has 3
>> > blogs) there is 403 status - failed to fetch.
>> >
>> >
>> > thanks & Regards
>> > Rajani
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]
>> > >wrote:
>> >
>> > > Hi Rajani,
>> > >
>> > > A url is marked as "db_gone" when nutch receives below HTTP error
>> codes
>> > for
>> > > the request:
>> > > 1. Bad request (error code: 400)
>> > > 2. Not found (error code: 404)
>> > > 3. Access denied (error code: 401)
>> > > 4. Permanently gone (error code: 410)
>> > >
>> > > Apart from this, a url can also be marked as "db_gone" if:
>> > > 5. its not getting crawled due to "Robots denied" or
>> > > 6. some exception is triggered while fetching the content from the
>> server
>> > > (eg. Read time out, Broken socket etc.)
>> > >
>> > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus
>> on
>> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I
>> preferred
>> > to
>> > > avoid discussing that.)
>> > >
>> > > The reason why you could not see the child pages in the crawldb:
>> because
>> > > the parent page has not been fetched successfully.
>> > >
>> > > Quick checks that you can try:
>> > > 1. can the url be fetched via wget command
>> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
>> > > cases 1-4
>> > > 2. what are the robots rules defined for the host ? Do they allow the
>> > > crawler to crawl that url ? this will address #5.
>> > > 3. After changing the parent page url from IP based to localhost and
>> > > running a *fresh* crawl, did you see any error or exception in the
>> logs ?
>> > > try running fresh crawl in local mode, its helps in debugging things
>> > > quickly.
>> > >
>> > > Thanks,
>> > > Tejas Patil
>> > >
>> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]
>> > > >wrote:
>> > >
>> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
>> > >
>> >
>>
>
>


-- 
*Lewis*

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to