missing pages issue

Tianwei Sheng Wed, 18 Jul 2012 21:15:01 -0700

Hi, all,

I used nutch for a while, it works well. But sometimes I found that it will
miss pages and the parse phase can't extract some outlinks. I check the
html pages and also enable the debug log for nutch, but still can't find
any clues why I will meet such problems.


For example, for the following seed URL, I can't crawl any pages. I am
doing vertical search, so I set the "db.ignore.external.links=true",
http://www.sidley.com/

besides, for the following seed url:
http://www.aalrr.com

when parsing the pages:
http://www.aalrr.com/attorneys/

I can't extract the links, such as "<a href="list.aspx?LastName=A">A</a>".

Do you guys have any idea why the parser will have problem?  Is it because
those html pages themselves have some problems or my setting is wrong. I
didn't add any filter in the regex_urlfilter.txt, just use:
+.

I tried both 1.5 and 2.0, the result is same.
nutch-1.5/runtime/local$ ./bin/nutch crawl urls/ -dir db/ -depth 3

Hope you can give me some suggestions, thanks a lot.

Tianwei

missing pages issue

Reply via email to