Re: too few db_fetched

2012-02-29 Thread Markus Jelsma

Short anwer: continue crawling!


When going to crawl a large amount of records i wouldn't encourage you 
to use the crawl command. It's better to build a small shell script that 
repeats the crawl cycle over and over.


Remember, the depth parameter is nothing more than a crawl cycle 
exectuted twice! You'll never get far with two cycles.


On Wed, 29 Feb 2012 05:12:08 +0200, remi tassing 
tassingr...@gmail.com wrote:

Hi Jose,

We have this question very often and the short answer, with regard to
'stats' printout, is that everything is probably fine. For a more 
complete

answer plz search in the mailing-list or Google.

BTW, how did you change the heap size? I get some IOException when 
the TopN

is 'too' high

Remi

On Wednesday, February 29, 2012, pepe3059 pepe3...@gmail.com wrote:

Hello, I'm Jose, i have one question and i hope you can help me

I have nutch-1.4 and I'm crawling the web from a country (mx), for 
that
reason i changed regex-urlfilter to add the correct regex. the 
second

param

changed in nutch script was
the java heap amount because an error of memory space. Well my 
question is
because i am doing a crawling with depth 2 to two sites(seed) but i 
get so

few sites fetched. the result of readdb is below
TOTAL urls: 653
retry 0:653
min score:  0.0
avg score:  0.0077212863
max score:  1.028
status 1 (db_unfetched):504
status 2 (db_fetched):  139
status 3 (db_gone): 4
status 4 (db_redir_temp):   4
status 5 (db_redir_perm):   2
CrawlDb statistics: done

in some other posts i saw they changed protocol-httpclient for
protocol-http in nutch-site.xml but is the same with the two 
protocols.

I
did a -dump from crawldb and verify manually some db_unfetched urls 
to see
if those are unavailable but are correct and with content, no 
robots.txt

are

present in servers. What must i do to get more url's fetched?


sorry for my english, thank you


--
View this message in context:


http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html

Sent from the Nutch - User mailing list archive at Nabble.com.






Re: too few db_fetched

2012-02-29 Thread pepe3059
Thank you for your answers. remi tassing you can increase de java heap used
by Nutch modifying the variable JAVA_HEAP_MAX=-Xmx1000m included in the
script bin/nutch, 1gb is currently assigned.  



Another question for my problem is: I know mapred is used by default. I read
in one post that map and reduce tasks can interfere with the fetch process,
is that correct? where can i found information related with the status codes
or different values dumped by readdb?, i got information from one url with
the follow values 

http://cca.inegi.org.mx/en-contacto/foro-del-ccaVersion: 7
Status: 1 (db_unfetched)
Fetch time: Tue Feb 28 17:11:55 CST 2012
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.030734694
Signature: null
Metadata: 

thank you


--
View this message in context: 
http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3788086.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: too few db_fetched

2012-02-28 Thread remi tassing
Hi Jose,

We have this question very often and the short answer, with regard to
'stats' printout, is that everything is probably fine. For a more complete
answer plz search in the mailing-list or Google.

BTW, how did you change the heap size? I get some IOException when the TopN
is 'too' high

Remi

On Wednesday, February 29, 2012, pepe3059 pepe3...@gmail.com wrote:
 Hello, I'm Jose, i have one question and i hope you can help me

 I have nutch-1.4 and I'm crawling the web from a country (mx), for that
 reason i changed regex-urlfilter to add the correct regex. the second
param
 changed in nutch script was
 the java heap amount because an error of memory space. Well my question is
 because i am doing a crawling with depth 2 to two sites(seed) but i get so
 few sites fetched. the result of readdb is below
 TOTAL urls: 653
 retry 0:653
 min score:  0.0
 avg score:  0.0077212863
 max score:  1.028
 status 1 (db_unfetched):504
 status 2 (db_fetched):  139
 status 3 (db_gone): 4
 status 4 (db_redir_temp):   4
 status 5 (db_redir_perm):   2
 CrawlDb statistics: done

 in some other posts i saw they changed protocol-httpclient for
 protocol-http in nutch-site.xml but is the same with the two protocols.
I
 did a -dump from crawldb and verify manually some db_unfetched urls to see
 if those are unavailable but are correct and with content, no robots.txt
are
 present in servers. What must i do to get more url's fetched?


 sorry for my english, thank you


 --
 View this message in context:
http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
 Sent from the Nutch - User mailing list archive at Nabble.com.