Re: crawling forum pages

Julien Nioche Tue, 09 Oct 2012 13:45:06 -0700

Depth is a misleading term which should be replaced by round. Why don't you
write a HTMLParser to extract the total number of pages and generate
outlinks to all the pages beyond the first one i.e. the whole range from 2
to 30? That's assuming that the total number of pages is expressed in a
consistent way of course.


HTH

Julien

On 9 October 2012 10:15, Jiang Fung Wong <[email protected]> wrote:

> Hi All,
>
> I am setting up nutch to crawl forum pages and index the posts in the
> content pages (threads). I face a problem: nutch could not discover
> all content pages, despite me setting a very high depth.
>
> This is because, typically a thread could have many posts that span
> several pages. Suppose I am at page 1 of 30. It only contains links to
> page2, page3, up to page10, and the last page.
>
> "[1,2,3,4....10] Next Last"
>
> I have to go to page 2 to discover page 11, and so on. So to discover
> all 30 pages, nutch has to explore pages 1~20, which is not possible
> with a typical depth.
>
> What should I do in this case?
>
>
> Regards,
> Jiang
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: crawling forum pages

Reply via email to