Hi, Thanks for your suggestions.
On Wed, Jul 18, 2012 at 10:08 PM, IT_ailen <[email protected]>wrote: > Hello, in my opinion, there are several causes for missing pages. > 1. the topN option you set may be too low; > I always didn't set topN, just used the default value, it works well for other crawling. I guess the default value is large enough. I remember it's the Interger.Maximum. 2. you are banned by the destine server because of frequently requiring; > I am pretty sure those sites will not ban me. > 3. the property "db.max.outlinks.per.page" limits the size of fetching > queue > But it seems that the number of outlinks on those pages should be OK. One potential issue may be some anchors are "img", not plain text, don't know if it will cause some problems in the parser. But for the case of "<a href="list.aspx?LastName=A">A</a>", the anchor is "A", should be able to extract that link and crawl it. Tianwei > I hope the tips can help you ~~ > > > ----- > I'm what I am. > -- > View this message in context: > http://lucene.472066.n3.nabble.com/missing-pages-issue-tp3995893p3995900.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

