I had faced similar problem while crawling over online shopping website to
gather the catalog of all available products. There were many products for
a given category and it was messy to follow all the "next" links.

Analyze the pattern for the next links. Define tighter regexes so that the
unwanted stuff doesn't get crawled.
If the crawlspace is small or the depth is less, this would be sufficient.
If the crawlspace is huge, its worth to spend some time modifying the
parsing logic so that you get the count of next pages and then create urls
corresponding to all the next links using the count.

-Tejas Patil

On Tue, Oct 9, 2012 at 1:44 PM, Julien Nioche <[email protected]
> wrote:

> Depth is a misleading term which should be replaced by round. Why don't you
> write a HTMLParser to extract the total number of pages and generate
> outlinks to all the pages beyond the first one i.e. the whole range from 2
> to 30? That's assuming that the total number of pages is expressed in a
> consistent way of course.
>
> HTH
>
> Julien
>
> On 9 October 2012 10:15, Jiang Fung Wong <[email protected]> wrote:
>
> > Hi All,
> >
> > I am setting up nutch to crawl forum pages and index the posts in the
> > content pages (threads). I face a problem: nutch could not discover
> > all content pages, despite me setting a very high depth.
> >
> > This is because, typically a thread could have many posts that span
> > several pages. Suppose I am at page 1 of 30. It only contains links to
> > page2, page3, up to page10, and the last page.
> >
> > "[1,2,3,4....10] Next Last"
> >
> > I have to go to page 2 to discover page 11, and so on. So to discover
> > all 30 pages, nutch has to explore pages 1~20, which is not possible
> > with a typical depth.
> >
> > What should I do in this case?
> >
> >
> > Regards,
> > Jiang
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to