Depth is a misleading term which should be replaced by round. Why don't you write a HTMLParser to extract the total number of pages and generate outlinks to all the pages beyond the first one i.e. the whole range from 2 to 30? That's assuming that the total number of pages is expressed in a consistent way of course.
HTH Julien On 9 October 2012 10:15, Jiang Fung Wong <[email protected]> wrote: > Hi All, > > I am setting up nutch to crawl forum pages and index the posts in the > content pages (threads). I face a problem: nutch could not discover > all content pages, despite me setting a very high depth. > > This is because, typically a thread could have many posts that span > several pages. Suppose I am at page 1 of 30. It only contains links to > page2, page3, up to page10, and the last page. > > "[1,2,3,4....10] Next Last" > > I have to go to page 2 to discover page 11, and so on. So to discover > all 30 pages, nutch has to explore pages 1~20, which is not possible > with a typical depth. > > What should I do in this case? > > > Regards, > Jiang > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

