Re: Crawling localhost Webapps - regex- urfilter query

Tejas Patil Mon, 17 Dec 2012 00:44:17 -0800

Lets break down the possibilities:

*A. The main url [1] does NOT gets crawled. *
This can happen due to some regex mismatch, or expception while crawling
the url.
One naive way to forget about regex rules is to simply add "+." at the
start of the regex rules file. It will start accepting any url.
Now run a fresh crawl and see if the url is getting fetched or not. How to
check ? Use "bin/nutch CrawlDbReader" command.
If the status shown is db_unfetched, then check for logs which can provide
some exceptions responsible behind the failiure.
If it was fetched, then goto B.

*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.
If they are both db_unfetched, then there was some exception which caused
the issue. See logs for details.

If the child pages are not found in the DB, then there was some issue with
link extraction ie. getting the child links from the content of the main
page.
Read the segment for the first round of crawl (which will have the content
of the main page). Extract the content of the main page using the
"bin/nutch SegmentReader" command. Check if the content fetched has the
child urls in it. If yes, then the issue is with link extraction.

Please do this and revert back to this group with your observations.

Thanks,
Tejas Patil

[1] : http://43.44.111.123:8080/nutch-test-site/ch-1.html

On Sun, Dec 16, 2012 at 9:48 PM, Rajani Maski <[email protected]> wrote:

> Hi users,
>
>    I am trying to crawl the web applications running on the local apache
> tomcat webserver. Note : tomcat version 7, running on 8080 port.
>
>
> The Main html page is :
> http://43.44.111.123:8080/nutch-test-site/ch-1.html.
> This main page is having an hyperlink to call its sub child  -
> http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
> and the sub-child is again having its own child as hyperlink   -
> http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html
>
>
> Now *I would like to know what is the filter that has to be given in
> regex-url-filter.txt to accept crawling for this site*.
> Because I am getting log as No more urls to fetch. This seems to be mistake
> in my regex-urlfilter.txt or seed.txt
>
> I tried with the following cases setup:
>
> *Case 1*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/nutch-test-site/child-1.html
>
> seed.txt -
>   http://43.44.111.123:8080/nutch-test-site/child-1.html
>
>
> *Case 2*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/
>
> seed.txt -
>   http://43.44.111.123:8080/nutch-test-site/child-1.html
>
>
> *Case 3*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/
>
> seed.txt -
>   http://43.44.111.123:8080/
>
>
> Output : Stopping at depth=1 - no more URLs to fetch.
>
>
> *Nutch command: *
> * bin/nutch crawl urls -dir tomcatcrawl -solr
> http://localhost:8080/solrnutch -depth 3 -topN 5 *
> *
> *
> *
> *
> Can you please point me out the mistake here.?
>
> Regards
> Rajani.
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to