Re: nutch is not fetching all the pages

Filip Stysiak Thu, 13 Jul 2017 06:43:34 -0700

Try loosing the restrictions on the contents limits

<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>


<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>-1</value>
</property>

maybe this will help.

2017-07-12 15:57 GMT+02:00 Srinivasa, Rashmi <[email protected]>:

> Hello,
>
> I've been trying to get nutch to crawl all of my site (let's call it
> my_domain_name.com) for a while now, but it's not working. These are my
> settings:
>
> ---
> nutch-site.xml:
>   db.ignore.external.links = true
>   db.ignore.external.links.mode = byDomain
>   db.max.outlinks.per.page = -1
>   http, file and ftp content fetch limits = -1
>   http.redirect.max = 2
>
> ---
> regex-urlfilter.txt:
>   # skip file: ftp: and mailto: urls
>   -^(file|ftp|mailto):
>
>   # skip image and other suffixes we can't yet parse
>   -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
>   # Accept everything else
>   +.
>
> ---
> Command:
>   bin/crawl -i -D solr.server.url=http://localhost:8983/solr/my_core_name
> urls_seed_directory/ my_crawl_name/ -1
>
> ---
> When I do a readdb, I find 29,000 pages in the db_unfetched state. I tried
> several crawls, but the number of unfetched documents just seems to
> increase.
> There is no pattern as to which documents stay unfetched. Some documents
> of the exact same type and in the same portion of the sitemap get fetched
> correctly, but others don't. Some pdfs get fetched correctly, but others
> don't. (And it's not a size limit problem - I checked.) There's nothing in
> robots.txt that would disallow them from being fetched.
> I took one of the pdf docs that are in the db_unfetched state, and ran
> parsechecker on it. It parsed the contents correctly.
> I looked at the crawl dump generated by readdb and couldn't find any
> errors or detailed information re: why something wasn't fetched.
>
> I'm at a loss here. How can I make nutch crawl the entire site and fetch
> all the pages/documents? I'm talking about a site with about 40,000 pages,
> not millions.
>
> Thanks,
> Rashmi
>
> Confidentiality Notice::  This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information.
> If you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited.  If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately.  You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>

Re: nutch is not fetching all the pages

Reply via email to