Hi, all, I used nutch for a while, it works well. But sometimes I found that it will miss pages and the parse phase can't extract some outlinks. I check the html pages and also enable the debug log for nutch, but still can't find any clues why I will meet such problems.
For example, for the following seed URL, I can't crawl any pages. I am doing vertical search, so I set the "db.ignore.external.links=true", http://www.sidley.com/ besides, for the following seed url: http://www.aalrr.com when parsing the pages: http://www.aalrr.com/attorneys/ I can't extract the links, such as "<a href="list.aspx?LastName=A">A</a>". Do you guys have any idea why the parser will have problem? Is it because those html pages themselves have some problems or my setting is wrong. I didn't add any filter in the regex_urlfilter.txt, just use: +. I tried both 1.5 and 2.0, the result is same. nutch-1.5/runtime/local$ ./bin/nutch crawl urls/ -dir db/ -depth 3 Hope you can give me some suggestions, thanks a lot. Tianwei

