http.content.length override? I haven't checked your URL (although I do like your taste in music :) ) but this is a possible source. hth Lewis
On Thursday, August 15, 2013, porcelet <[email protected]> wrote: > Hello, i'm trying to index all beatles tabs from www.ultimate-guitar.combut > nutch don't want to crawl a lot of links > for exemple it reaches that page > http://www.ultimate-guitar.com/tabs/beatles_tabs.htm > but at the next fetch step it doesn't fetch > http://www.ultimate-guitar.com/tabs/beatles_tabs3.htm or > http://www.ultimate-guitar.com/tabs/beatles_tabs4.htm or....in fact it > fetches some tabs in beatles_tabs.htm but only some in "album frame" > .... > Actually, that doesn't appear in the outlinks of id = > 'com.ultimate-guitar.www:http/tabs/beatles_tabs.htm' > It doen'st even fetch all the links of that first index page. > > Here is my regex-urlfilter.txt > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > # skip URLs containing certain characters as probable queries, etc. > -[?*!@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > -(_btab.*html)$ > -(guitar_pro.htm)$ > -(btab.htm)$ > > #URLS to visit > > ################# > #Ultimate Guitar# > ################# > +http://www.ultimate-guitar.com/$ > +http://www.ultimate-guitar.com/bands/b.htm > +http://www.ultimate-guitar.com/bands/b8.htm > +http://www.ultimate-guitar.com/tabs/beatles_tabs[0-9]*.htm > +http://tabs.ultimate-guitar.com/b/beatles/.*_(crd|tab).htm > > > I add in my nutch-site.xml I added that to have all outlinks > <property> > <name>db.max.outlinks.per.page</name> > <value>-1</value> > <description>-1 -> process all outlinks > /description> > </property> > > > Where is the problem? > > Another question if i use 2 threads per queue for 3 fetcher thread (there is > only one website,I receive error 503). Is it avoidable? I fetch about > 6pages/s with that parameters, protection again ddos, i guess? > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Nutch-doen-t-crawl-all-links-tp4084762.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

