Hi JC,

I think Marcus already answered about politeness :) But without delay it will be worse :)

Do this missing URLs match on one of the filtering regex?
Take a look at .../conf/regex-urlfilter.txt, I had a problem with this regex:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
It will just silently drop all URLs with GET parameters.

--Roland


Am 01.03.2013 15:08, schrieb jc:
Hi Roland and lufeng,

Thank you very much for your replies, I already tested lufeng advice, with
results pretty much as expected.

By the way, my nutch installation is based on 2.1 version with hbase as
crawldb storage

Roland, maybe fetcher.server.delay param has something to do with that as
well, I set it to 3 secs, setting it to 0 would be unpolite?

All info you provided has helped me a lot, only one issue remains unfixed
yet, there are more than 60 URLs from different hosts in my seed file, and
only 20 queues, things may seem that all other 40 hosts have no more URLs to
generate, but I really haven't seen any URL coming from those hosts since
the creation of the crawldb.

Based on my poor experience following params would allow a number of 60
queues for my vertical crawl, am I missing something?

topN = 1 million
fetcher.threads.per.queue = 3
fetcher.threads.per.host = 3 (just in case, I remember you told me to use
per.queue instead)
fetcher.threads.fetch = 200
seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
urls from these hosts, they're all there, I checked)
crawldb record count > 1 million

Thanks again for all your help

Regards,
JC

Reply via email to