Re: Nutch doen't crawl all links

Lewis John Mcgibbney Thu, 15 Aug 2013 12:05:30 -0700

http.content.length override?
I haven't checked your URL (although I do like your taste in music :) ) but
this is a possible source.
hth
Lewis


On Thursday, August 15, 2013, porcelet <[email protected]> wrote:
> Hello, i'm trying to index all beatles tabs from www.ultimate-guitar.combut
> nutch don't want to crawl a lot of links
> for exemple it reaches that page
> http://www.ultimate-guitar.com/tabs/beatles_tabs.htm
> but at the next fetch step it doesn't fetch
> http://www.ultimate-guitar.com/tabs/beatles_tabs3.htm or
> http://www.ultimate-guitar.com/tabs/beatles_tabs4.htm or....in fact it
> fetches some tabs in beatles_tabs.htm but only some in "album frame"
> ....
> Actually, that doesn't appear in the outlinks of id =
> 'com.ultimate-guitar.www:http/tabs/beatles_tabs.htm'
> It doen'st even fetch all the links of that first index page.
>
> Here is my regex-urlfilter.txt
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> -(_btab.*html)$
> -(guitar_pro.htm)$
> -(btab.htm)$
>
> #URLS to visit
>
> #################
> #Ultimate Guitar#
> #################
> +http://www.ultimate-guitar.com/$
> +http://www.ultimate-guitar.com/bands/b.htm
> +http://www.ultimate-guitar.com/bands/b8.htm
> +http://www.ultimate-guitar.com/tabs/beatles_tabs[0-9]*.htm
> +http://tabs.ultimate-guitar.com/b/beatles/.*_(crd|tab).htm
>
>
> I add in my nutch-site.xml I added that to have all outlinks
> <property>
> <name>db.max.outlinks.per.page</name>
> <value>-1</value>
> <description>-1 -> process all outlinks
> /description>
> </property>
>
>
> Where is the problem?
>
> Another question if i use 2 threads per queue for 3 fetcher thread (there
is
> only one website,I receive error 503). Is it avoidable? I fetch about
> 6pages/s with that parameters, protection again ddos, i guess?
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-doen-t-crawl-all-links-tp4084762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Nutch doen't crawl all links

Reply via email to