Still no luck with fetching all the pages on my site. It never goes beyond Iteration 3. Is the number of iterations configured somewhere?
Thanks, Rashmi -----Original Message----- From: Srinivasa, Rashmi [mailto:[email protected]] Sent: Thursday, July 13, 2017 9:46 AM To: [email protected] Subject: [EXTERNAL] RE: nutch is not fetching all the pages Hi Filip, Thanks for the suggestion! I already have those 3 values set to -1. I also checked a few of the pdf documents in the db_unfetched state, and many of them are smaller than others that have been fetched successfully. So it doesn't look like a size problem... Thanks, Rashmi -----Original Message----- From: Filip Stysiak [mailto:[email protected]] Sent: Thursday, July 13, 2017 9:43 AM To: [email protected] Subject: [EXTERNAL] Re: nutch is not fetching all the pages Try loosing the restrictions on the contents limits <property> <name>file.content.limit</name> <value>-1</value> </property> <property> <name>http.content.limit</name> <value>-1</value> </property> <property> <name>ftp.content.limit</name> <value>-1</value> </property> maybe this will help. 2017-07-12 15:57 GMT+02:00 Srinivasa, Rashmi <[email protected]>: > Hello, > > I've been trying to get nutch to crawl all of my site (let's call it > my_domain_name.com) for a while now, but it's not working. These are > my > settings: > > --- > nutch-site.xml: > db.ignore.external.links = true > db.ignore.external.links.mode = byDomain > db.max.outlinks.per.page = -1 > http, file and ftp content fetch limits = -1 > http.redirect.max = 2 > > --- > regex-urlfilter.txt: > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| > EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| > tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > # Accept everything else > +. > > --- > Command: > bin/crawl -i -D > solr.server.url=http://localhost:8983/solr/my_core_name > urls_seed_directory/ my_crawl_name/ -1 > > --- > When I do a readdb, I find 29,000 pages in the db_unfetched state. I > tried several crawls, but the number of unfetched documents just seems > to increase. > There is no pattern as to which documents stay unfetched. Some > documents of the exact same type and in the same portion of the > sitemap get fetched correctly, but others don't. Some pdfs get fetched > correctly, but others don't. (And it's not a size limit problem - I > checked.) There's nothing in robots.txt that would disallow them from being > fetched. > I took one of the pdf docs that are in the db_unfetched state, and ran > parsechecker on it. It parsed the contents correctly. > I looked at the crawl dump generated by readdb and couldn't find any > errors or detailed information re: why something wasn't fetched. > > I'm at a loss here. How can I make nutch crawl the entire site and > fetch all the pages/documents? I'm talking about a site with about > 40,000 pages, not millions. > > Thanks, > Rashmi > > Confidentiality Notice:: This email, including attachments, may > include non-public, proprietary, confidential or legally privileged > information. > If you are not an intended recipient or an authorized agent of an > intended recipient, you are hereby notified that any dissemination, > distribution or copying of the information contained in or transmitted > with this e-mail is unauthorized and strictly prohibited. If you have > received this email in error, please notify the sender by replying to > this message and permanently delete this e-mail, its attachments, and > any copies of it immediately. You should not retain, copy or use this > e-mail or any attachment for any purpose, nor disclose all or any part of the > contents to any other person. > Thank you. >

