Hello, i'm trying to index all beatles tabs from www.ultimate-guitar.com but nutch don't want to crawl a lot of links for exemple it reaches that page http://www.ultimate-guitar.com/tabs/beatles_tabs.htm but at the next fetch step it doesn't fetch http://www.ultimate-guitar.com/tabs/beatles_tabs3.htm or http://www.ultimate-guitar.com/tabs/beatles_tabs4.htm or....in fact it fetches some tabs in beatles_tabs.htm but only some in "album frame" .... Actually, that doesn't appear in the outlinks of id = 'com.ultimate-guitar.www:http/tabs/beatles_tabs.htm' It doen'st even fetch all the links of that first index page.
Here is my regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ -(_btab.*html)$ -(guitar_pro.htm)$ -(btab.htm)$ #URLS to visit ################# #Ultimate Guitar# ################# +http://www.ultimate-guitar.com/$ +http://www.ultimate-guitar.com/bands/b.htm +http://www.ultimate-guitar.com/bands/b8.htm +http://www.ultimate-guitar.com/tabs/beatles_tabs[0-9]*.htm +http://tabs.ultimate-guitar.com/b/beatles/.*_(crd|tab).htm I add in my nutch-site.xml I added that to have all outlinks <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>-1 -> process all outlinks /description> </property> Where is the problem? Another question if i use 2 threads per queue for 3 fetcher thread (there is only one website,I receive error 503). Is it avoidable? I fetch about 6pages/s with that parameters, protection again ddos, i guess? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-doen-t-crawl-all-links-tp4084762.html Sent from the Nutch - User mailing list archive at Nabble.com.

