I have similar problems. For me it seems to be when many of the pages get a
very low ranking and therefore never get fetched. If I kickoff the scan
again it goes one more layer deeper down the rabbit hole. I thought about
trying to reduce that % which is needed in order to fetch those pages.
Still honestly have not solved it but thought i'd mention I'm seeing
similar tendencies.

On Sun, Apr 12, 2015 at 8:00 PM, Yousin Kim <[email protected]> wrote:

> Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
> online-shop.
>
> But, I got only front pages except detail pages of products.
> How can I get product detail pages?
>
> Thank you :)
>
> I want to get urls like :
>
> http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2
>
> my seed list is http://www.vanillashu.co.kr/
>
> regex-urlfilter
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> #+.
>
> +^(http|https)://.* vanillashu.co.kr/
>

Reply via email to