Try loosing the restrictions on the contents limits <property> <name>file.content.limit</name> <value>-1</value> </property>
<property> <name>http.content.limit</name> <value>-1</value> </property> <property> <name>ftp.content.limit</name> <value>-1</value> </property> maybe this will help. 2017-07-12 15:57 GMT+02:00 Srinivasa, Rashmi <[email protected]>: > Hello, > > I've been trying to get nutch to crawl all of my site (let's call it > my_domain_name.com) for a while now, but it's not working. These are my > settings: > > --- > nutch-site.xml: > db.ignore.external.links = true > db.ignore.external.links.mode = byDomain > db.max.outlinks.per.page = -1 > http, file and ftp content fetch limits = -1 > http.redirect.max = 2 > > --- > regex-urlfilter.txt: > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| > EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| > tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > # Accept everything else > +. > > --- > Command: > bin/crawl -i -D solr.server.url=http://localhost:8983/solr/my_core_name > urls_seed_directory/ my_crawl_name/ -1 > > --- > When I do a readdb, I find 29,000 pages in the db_unfetched state. I tried > several crawls, but the number of unfetched documents just seems to > increase. > There is no pattern as to which documents stay unfetched. Some documents > of the exact same type and in the same portion of the sitemap get fetched > correctly, but others don't. Some pdfs get fetched correctly, but others > don't. (And it's not a size limit problem - I checked.) There's nothing in > robots.txt that would disallow them from being fetched. > I took one of the pdf docs that are in the db_unfetched state, and ran > parsechecker on it. It parsed the contents correctly. > I looked at the crawl dump generated by readdb and couldn't find any > errors or detailed information re: why something wasn't fetched. > > I'm at a loss here. How can I make nutch crawl the entire site and fetch > all the pages/documents? I'm talking about a site with about 40,000 pages, > not millions. > > Thanks, > Rashmi > > Confidentiality Notice:: This email, including attachments, may include > non-public, proprietary, confidential or legally privileged information. > If you are not an intended recipient or an authorized agent of an intended > recipient, you are hereby notified that any dissemination, distribution or > copying of the information contained in or transmitted with this e-mail is > unauthorized and strictly prohibited. If you have received this email in > error, please notify the sender by replying to this message and permanently > delete this e-mail, its attachments, and any copies of it immediately. You > should not retain, copy or use this e-mail or any attachment for any > purpose, nor disclose all or any part of the contents to any other person. > Thank you. >

