Hi Jose,

We have this question very often and the short answer, with regard to
'stats' printout, is that everything is probably fine. For a more complete
answer plz search in the mailing-list or Google.

BTW, how did you change the heap size? I get some IOException when the TopN
is 'too' high

Remi

On Wednesday, February 29, 2012, pepe3059 <pepe3...@gmail.com> wrote:
> Hello, I'm Jose, i have one question and i hope you can help me
>
> I have nutch-1.4 and I'm crawling the web from a country (mx), for that
> reason i changed regex-urlfilter to add the correct regex. the second
param
> changed in nutch script was
> the java heap amount because an error of memory space. Well my question is
> because i am doing a crawling with depth 2 to two sites(seed) but i get so
> few sites fetched. the result of readdb is below
> TOTAL urls:     653
> retry 0:        653
> min score:      0.0
> avg score:      0.0077212863
> max score:      1.028
> status 1 (db_unfetched):        504
> status 2 (db_fetched):  139
> status 3 (db_gone):     4
> status 4 (db_redir_temp):       4
> status 5 (db_redir_perm):       2
> CrawlDb statistics: done
>
> in some other posts i saw they changed "protocol-httpclient" for
> "protocol-http" in nutch-site.xml but is the same with the two protocols.
I
> did a -dump from crawldb and verify manually some db_unfetched urls to see
> if those are unavailable but are correct and with content, no robots.txt
are
> present in servers. What must i do to get more url's fetched?
>
>
> sorry for my english, thank you
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to