Tianwei wrote
> 
> I always didn't set topN, just used the default value, it works well for
> other crawling. I guess the default value is large  enough. I remember
> it's
> the Interger.Maximum.
> 
I google the default value of topN and you are right.

Tianwei wrote
> 
> But it seems that the number of outlinks on those pages should be OK.
> 
I suggest you to set the value of this property to -1 

Tianwei wrote
> 
> One potential issue may be some anchors are "img", not plain text, don't
> know if it will cause some problems in the parser. But for the case of "
> list.aspx?LastName=A A ", the anchor is "A", should be able to
> extract that link and crawl it.
> 
Some image anchors may be filtered. In the file regex-urlfilter.txt, there
is a default negative filter,

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|   
ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

I thank you can check it in you filter-file~~


-----
I'm what I am.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/missing-pages-issue-tp3995893p3995916.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to