Re: Crawling localhost Webapps - regex- urfilter query

Rajani Maski Tue, 18 Dec 2012 21:27:18 -0800

Hi Tejas,

 I found out the reason for why the blog was not getting crawled :
http://rajinimaski.blogspot.in/
This is because of the proxy that has filter(block) for blog sites. Used
different IP and
 Now I am able to crawl the above blog site successfully.


However the html files that I have put in local tomcat webserver are not
getting crawled and there are no errors also. attached is the log file and
html sample pages.I will look at the robot rules for this and get back.

Thanks very much
Regards
Rajani





On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <[email protected]>wrote:

> Hi Rajani,
>
> *Robot rules? I didn't get this check. Did you mean any setting in
> nutch-site
> xml ?*
> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>
> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
> end.
> Without any error or exception its hard to tell issue. Set the logger to
> TRACE or DEBUG and see the logs created for the fetch phase.
> There must be some message regarding the url like
> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
> code=403,
> url=http://www.abcd.edu/~pqr/homework.html
> or
> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
> http://www.ics.uci.edu/~dan/class/260/notes/
> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
> http://www.ics.uci.edu/~dan/class/260/notes/
> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>
> or something else that can shed the light on the issue.
>
> Thanks,
> Tejas Patil
>
> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]>
> wrote:
>
> > Hi Tejas,
> > Thank you for detailed information. For the checks,
> >
> > Check 1  - can the url be fetched via wget command :
> >
> > ubuntu@ubuntu-OptiPlex-390:~$ wget
> > http://localhost:8080/nutch-test-site/child-1.html
> > --2012-12-18 16:07:34--
> > http://localhost:8080/nutch-test-site/child-1.html
> > Resolving localhost (localhost)... 127.0.0.1
> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
> > HTTP request sent, awaiting response... 200 OK
> > Length: 102 [text/html]
> > Saving to: `child-1.html.1'
> >
> > 100%[======================================>] 102         --.-K/s   in 0s
> >
> >
> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
> >
> > Check 2 : what are the robots rules defined for the host ? Do they allow
> > the
> > crawler to crawl that url ? this will address #5.
> > Robot rules? I didn't get this check. Did you mean any setting in
> > nutch-site xml ?
> >
> > 3. After changing the parent page url from IP based to localhost and
> > running a *fresh* crawl, did you see any error or exception in the logs ?
> > try running fresh crawl in local mode, its helps in debugging things
> > quickly.
> >
> > Did a fresh crawl. There are no errors only warnings. The stats is same
> as
> > above.
> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
> > http://localhost:8080/nutch-test-site/child-1.html
> >
> > Also important observation is when I set other sites for crawling like
> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed
> > to
> > solr. But when I crawl the above html page nothing is fetched. Also when
> I
> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has
> 3
> > blogs) there is 403 status - failed to fetch.
> >
> >
> > thanks & Regards
> > Rajani
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > Hi Rajani,
> > >
> > > A url is marked as "db_gone" when nutch receives below HTTP error codes
> > for
> > > the request:
> > > 1. Bad request (error code: 400)
> > > 2. Not found (error code: 404)
> > > 3. Access denied (error code: 401)
> > > 4. Permanently gone (error code: 410)
> > >
> > > Apart from this, a url can also be marked as "db_gone" if:
> > > 5. its not getting crawled due to "Robots denied" or
> > > 6. some exception is triggered while fetching the content from the
> server
> > > (eg. Read time out, Broken socket etc.)
> > >
> > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus
> on
> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred
> > to
> > > avoid discussing that.)
> > >
> > > The reason why you could not see the child pages in the crawldb:
> because
> > > the parent page has not been fetched successfully.
> > >
> > > Quick checks that you can try:
> > > 1. can the url be fetched via wget command
> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
> > > cases 1-4
> > > 2. what are the robots rules defined for the host ? Do they allow the
> > > crawler to crawl that url ? this will address #5.
> > > 3. After changing the parent page url from IP based to localhost and
> > > running a *fresh* crawl, did you see any error or exception in the
> logs ?
> > > try running fresh crawl in local mode, its helps in debugging things
> > > quickly.
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <[email protected]
> > > >wrote:
> > >
> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
> > >
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to