nutch is not fetching all the pages

Srinivasa, Rashmi Wed, 12 Jul 2017 06:58:38 -0700

Hello,

I've been trying to get nutch to crawl all of my site (let's call it 
my_domain_name.com) for a while now, but it's not working. These are my 
settings:


---
nutch-site.xml:
  db.ignore.external.links = true
  db.ignore.external.links.mode = byDomain
  db.max.outlinks.per.page = -1
  http, file and ftp content fetch limits = -1
  http.redirect.max = 2

---
regex-urlfilter.txt:
  # skip file: ftp: and mailto: urls
  -^(file|ftp|mailto):

  # skip image and other suffixes we can't yet parse
  
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

  # Accept everything else
  +.

---
Command:
  bin/crawl -i -D solr.server.url=http://localhost:8983/solr/my_core_name 
urls_seed_directory/ my_crawl_name/ -1

---
When I do a readdb, I find 29,000 pages in the db_unfetched state. I tried 
several crawls, but the number of unfetched documents just seems to increase.
There is no pattern as to which documents stay unfetched. Some documents of the 
exact same type and in the same portion of the sitemap get fetched correctly, 
but others don't. Some pdfs get fetched correctly, but others don't. (And it's 
not a size limit problem - I checked.) There's nothing in robots.txt that would 
disallow them from being fetched.
I took one of the pdf docs that are in the db_unfetched state, and ran 
parsechecker on it. It parsed the contents correctly.
I looked at the crawl dump generated by readdb and couldn't find any errors or 
detailed information re: why something wasn't fetched.

I'm at a loss here. How can I make nutch crawl the entire site and fetch all 
the pages/documents? I'm talking about a site with about 40,000 pages, not 
millions.

Thanks,
Rashmi

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

nutch is not fetching all the pages

Reply via email to