Re: Crawling localhost Webapps - regex- urfilter query

Rajani Maski Mon, 17 Dec 2012 20:52:18 -0800

Hi Tejas,

 Thank you very much for the detailed reply.

Please find the observations embedded in the email :

*A. The main url [1] does NOT gets crawled. *
This can happen due to some regex mismatch, or expception while crawling
the url.
One naive way to forget about regex rules is to simply add "+." at the
start of the regex rules file. It will start accepting any url.
Done
Now run a fresh crawl and see if the url is getting fetched or not. How to
check ? Use "bin/nutch CrawlDbReader" command.

Command used is :  bin/nutch readdb crawlnewtest -stats
CrawlDb statistics start: crawlnewtest
Statistics for CrawlDb: crawlnewtest
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done

db_unfetched - status is 1. .
Log has only info and warning. No errors
WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
WARN  mapred.JobClient - Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.

If the status shown is db_unfetched, then check for logs which can provide
some exceptions responsible behind the failiure.
If it was fetched, then goto B.

Does the above status mean that url[1] is crawled? If that is the case, why
is it not indexed to solr? The nutch command I have used is : bin/nutch
crawl urls -dir crawlnewtest -solr http://localhost:8080/solrnutch -depth 3
-topN 5

I do not see any error logs other than the warns that I have mentioned
above. Yet to follow the last few steps of reading segments.

*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.

If they are both db_unfetched, then there was some exception which caused
the issue. See logs for details.

If the child pages are not found in the DB, then there was some issue with
link extraction ie. getting the child links from the content of the main
page.
Read the segment for the first round of crawl (which will have the content
of the main page). Extract the content of the main page using the
"bin/nutch SegmentReader" command. Check if the content fetched has the
child urls in it. If yes, then the issue is with link extraction.

On Mon, Dec 17, 2012 at 2:13 PM, Tejas Patil <[email protected]>wrote:

> Lets break down the possibilities:
>
> *A. The main url [1] does NOT gets crawled. *
> This can happen due to some regex mismatch, or expception while crawling
> the url.
> One naive way to forget about regex rules is to simply add "+." at the
> start of the regex rules file. It will start accepting any url.
> Now run a fresh crawl and see if the url is getting fetched or not. How to
> check ? Use "bin/nutch CrawlDbReader" command.
> If the status shown is db_unfetched, then check for logs which can provide
> some exceptions responsible behind the failiure.
> If it was fetched, then goto B.
>
> *B. The main url gets crawled successfully but the rest 2 child pages are
> not getting crawled.*
> If the url [1] is db_fetched, then use the same
> command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
> pages.
> If they are both db_unfetched, then there was some exception which caused
> the issue. See logs for details.
>
> If the child pages are not found in the DB, then there was some issue with
> link extraction ie. getting the child links from the content of the main
> page.
> Read the segment for the first round of crawl (which will have the content
> of the main page). Extract the content of the main page using the
> "bin/nutch SegmentReader" command. Check if the content fetched has the
> child urls in it. If yes, then the issue is with link extraction.
>
> Please do this and revert back to this group with your observations.
>
> Thanks,
> Tejas Patil
>
>
> [1] : http://43.44.111.123:8080/nutch-test-site/ch-1.html
>
>
> On Sun, Dec 16, 2012 at 9:48 PM, Rajani Maski <[email protected]>
> wrote:
>
> > Hi users,
> >
> >    I am trying to crawl the web applications running on the local apache
> > tomcat webserver. Note : tomcat version 7, running on 8080 port.
> >
> >
> > The Main html page is :
> > http://43.44.111.123:8080/nutch-test-site/ch-1.html.
> > This main page is having an hyperlink to call its sub child  -
> > http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
> > and the sub-child is again having its own child as hyperlink   -
> > http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html
> >
> >
> > Now *I would like to know what is the filter that has to be given in
> > regex-url-filter.txt to accept crawling for this site*.
> > Because I am getting log as No more urls to fetch. This seems to be
> mistake
> > in my regex-urlfilter.txt or seed.txt
> >
> > I tried with the following cases setup:
> >
> > *Case 1*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> > seed.txt -
> >   http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> >
> > *Case 2*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/
> >
> > seed.txt -
> >   http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> >
> > *Case 3*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/
> >
> > seed.txt -
> >   http://43.44.111.123:8080/
> >
> >
> > Output : Stopping at depth=1 - no more URLs to fetch.
> >
> >
> > *Nutch command: *
> > * bin/nutch crawl urls -dir tomcatcrawl -solr
> > http://localhost:8080/solrnutch -depth 3 -topN 5 *
> > *
> > *
> > *
> > *
> > Can you please point me out the mistake here.?
> >
> > Regards
> > Rajani.
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to