Tianwei wrote > > I always didn't set topN, just used the default value, it works well for > other crawling. I guess the default value is large enough. I remember > it's > the Interger.Maximum. > I google the default value of topN and you are right.
Tianwei wrote > > But it seems that the number of outlinks on those pages should be OK. > I suggest you to set the value of this property to -1 Tianwei wrote > > One potential issue may be some anchors are "img", not plain text, don't > know if it will cause some problems in the parser. But for the case of " > list.aspx?LastName=A A ", the anchor is "A", should be able to > extract that link and crawl it. > Some image anchors may be filtered. In the file regex-urlfilter.txt, there is a default negative filter, -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP| ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ I thank you can check it in you filter-file~~ ----- I'm what I am. -- View this message in context: http://lucene.472066.n3.nabble.com/missing-pages-issue-tp3995893p3995916.html Sent from the Nutch - User mailing list archive at Nabble.com.

