Re: Crawling localhost Webapps - regex- urfilter query

Rajani Maski Mon, 17 Dec 2012 23:35:21 -0800

Hi Tejas,

  Please find my replied embedded. Thank you for the reply and time.


>
> *"status 1 (db_unfetched): 1"* means that url [1] is NOT crawled.
> (FYI: it is not interpreted as "db_unfetched - status is 1". The number 1
> here indicates that there is 1 url in the crawldb with status as
> db_unfetched.)
>
> You said that there are no exceptions in the log file. Which log file did
> you see ?

If you are running in the distributed mode, then you must see the hadoop
> logs (on jobtracker) for the nutch jobs.
>


> It is  a basic local set up.

 Downloaded the binary version of apache nutch 1.5.1 and followed the setup
steps mentioned in wiki. The log file path is ../Downloads/apache-nutch-1.5.
1/logs/hadoop.log
   Note : I was using system IPaddress  yesterday  but there is some
exception today for the same url (Exception :2012-12-18 12:42:04,558 INFO
 fetcher.Fetcher - fetch of
http://43.44.111.123:8080/nutch-test-site/home.html failed with:
java.net.SocketTimeoutException: Read timed out)
So I changed it to localhost and now the stat is

ubuntu@ubuntu-OptiPlex-390:~/Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -stats
CrawlDb statistics start: crawlnewtest/crawldb
Statistics for CrawlDb: crawlnewtest/crawldb
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 3 (db_gone): 1
CrawlDb statistics: done

>
> Also, can you send the entry of the url [1] from the crawldb ? The command
> is:
> *bin/nutch readdb <path to the crawldb> -url <url>*
> *
> *
>
Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -url http://localhost:8080/nutch-test-site/home.html
URL: http://localhost:8080/nutch-test-site/home.html
Version: 7
Status: 3 (db_gone)
Fetch time: Fri Feb 01 12:26:54 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: _pst_: gone(11), lastModified=0:
http://localhost:8080/nutch-test-site/home.html

 Can you please tell me what does this mean : Status: 3 (db_gone)   [ or
could you point me to the reference link where I can know about what such
response mean]

If you are not able to get any output for above command, then get the dump
> of whole crawldb using this command:
> *bin/nutch readdb <path to the crawldb> -dump <output directory>**
>

/Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -dump
/home/ubuntu/Downloads/apache-nutch-1.5.1/test
CrawlDb dump: starting
CrawlDb db: crawlnewtest/crawldb
CrawlDb dump: done

Output is :
http://localhost:8080/nutch-test-site/home.html Version: 7
Status: 3 (db_gone)
Fetch time: Fri Feb 01 12:26:54 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: _pst_: gone(11), lastModified=0:
http://localhost:8080/nutch-test-site/home.html

In the previous email of yours Point B - if db status is fetched,
*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.
Can you please correct me if the command I am using is wrong because
this gives me only one main url listed.
Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -stats

Thank you very much.

Regards
Rajani




> thanks,
> Tejas Patil
>
> On Mon, Dec 17, 2012 at 8:51 PM, Rajani Maski <[email protected]>
> wrote:
>
> > status 1 (db_unfetched): 1
>

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to