Re: I want to crawl deep pages

Michael Joyce Sat, 18 Apr 2015 20:32:28 -0700

Do you have any additional information? Config that you're using. Crawl
stats. Etc.


In general my approach to doing deep, single site crawls has been to ensure
the my config is as liberal as possible in terms of excluding links and
then use the regex to keep the crawl from going out of the relevant
domain(s).

One relevant property that I've had bite me before is:
<name>db.ignore.internal.links</name>


-- Jimmy

On Fri, Apr 17, 2015 at 4:13 PM, steve labar <[email protected]>
wrote:

> I have similar problems. For me it seems to be when many of the pages get a
> very low ranking and therefore never get fetched. If I kickoff the scan
> again it goes one more layer deeper down the rabbit hole. I thought about
> trying to reduce that % which is needed in order to fetch those pages.
> Still honestly have not solved it but thought i'd mention I'm seeing
> similar tendencies.
>
> On Sun, Apr 12, 2015 at 8:00 PM, Yousin Kim <[email protected]> wrote:
>
> > Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
> > online-shop.
> >
> > But, I got only front pages except detail pages of products.
> > How can I get product detail pages?
> >
> > Thank you :)
> >
> > I want to get urls like :
> >
> >
> http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2
> >
> > my seed list is http://www.vanillashu.co.kr/
> >
> > regex-urlfilter
> > # skip file: ftp: and mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > # for a more extensive coverage use the urlfilter-suffix plugin
> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > # accept anything else
> > #+.
> >
> > +^(http|https)://.* vanillashu.co.kr/
> >
>

Re: I want to crawl deep pages

Reply via email to