Hello, i'm trying to index all beatles tabs from www.ultimate-guitar.com but
nutch don't want to crawl a lot of links
for exemple it reaches that page
http://www.ultimate-guitar.com/tabs/beatles_tabs.htm
but at the next fetch step it doesn't fetch
http://www.ultimate-guitar.com/tabs/beatles_tabs3.htm or
http://www.ultimate-guitar.com/tabs/beatles_tabs4.htm or....in fact it
fetches some tabs in beatles_tabs.htm but only some in "album frame"
....
Actually, that doesn't appear in the outlinks of id =
'com.ultimate-guitar.www:http/tabs/beatles_tabs.htm'
It doen'st even fetch all the links of that first index page.

Here is my regex-urlfilter.txt

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-(_btab.*html)$
-(guitar_pro.htm)$
-(btab.htm)$

#URLS to visit

#################
#Ultimate Guitar#
#################
+http://www.ultimate-guitar.com/$
+http://www.ultimate-guitar.com/bands/b.htm
+http://www.ultimate-guitar.com/bands/b8.htm
+http://www.ultimate-guitar.com/tabs/beatles_tabs[0-9]*.htm
+http://tabs.ultimate-guitar.com/b/beatles/.*_(crd|tab).htm


I add in my nutch-site.xml I added that to have all outlinks
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>-1 -> process all outlinks
/description>
</property>


Where is the problem?

Another question if i use 2 threads per queue for 3 fetcher thread (there is
only one website,I receive error 503). Is it avoidable? I fetch about
6pages/s with that parameters, protection again ddos, i guess?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-doen-t-crawl-all-links-tp4084762.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to