RE: nutch is not fetching all the pages

Srinivasa, Rashmi Wed, 26 Jul 2017 06:48:34 -0700

Still no luck with fetching all the pages on my site. It never goes beyond 
Iteration 3. Is the number of iterations configured somewhere?


Thanks,
Rashmi

-----Original Message-----
From: Srinivasa, Rashmi [mailto:[email protected]] 
Sent: Thursday, July 13, 2017 9:46 AM
To: [email protected]
Subject: [EXTERNAL] RE: nutch is not fetching all the pages

Hi Filip,

Thanks for the suggestion! I already have those 3 values set to -1. I also 
checked a few of the pdf documents in the db_unfetched state, and many of them 
are smaller than others that have been fetched successfully. So it doesn't look 
like a size problem...

Thanks,
Rashmi

-----Original Message-----
From: Filip Stysiak [mailto:[email protected]]
Sent: Thursday, July 13, 2017 9:43 AM
To: [email protected]
Subject: [EXTERNAL] Re: nutch is not fetching all the pages

Try loosing the restrictions on the contents limits

<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>-1</value>
</property>

maybe this will help.

2017-07-12 15:57 GMT+02:00 Srinivasa, Rashmi <[email protected]>:

> Hello,
>
> I've been trying to get nutch to crawl all of my site (let's call it
> my_domain_name.com) for a while now, but it's not working. These are 
> my
> settings:
>
> ---
> nutch-site.xml:
>   db.ignore.external.links = true
>   db.ignore.external.links.mode = byDomain
>   db.max.outlinks.per.page = -1
>   http, file and ftp content fetch limits = -1
>   http.redirect.max = 2
>
> ---
> regex-urlfilter.txt:
>   # skip file: ftp: and mailto: urls
>   -^(file|ftp|mailto):
>
>   # skip image and other suffixes we can't yet parse
>   -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
>   # Accept everything else
>   +.
>
> ---
> Command:
>   bin/crawl -i -D
> solr.server.url=http://localhost:8983/solr/my_core_name
> urls_seed_directory/ my_crawl_name/ -1
>
> ---
> When I do a readdb, I find 29,000 pages in the db_unfetched state. I 
> tried several crawls, but the number of unfetched documents just seems 
> to increase.
> There is no pattern as to which documents stay unfetched. Some 
> documents of the exact same type and in the same portion of the 
> sitemap get fetched correctly, but others don't. Some pdfs get fetched 
> correctly, but others don't. (And it's not a size limit problem - I
> checked.) There's nothing in robots.txt that would disallow them from being 
> fetched.
> I took one of the pdf docs that are in the db_unfetched state, and ran 
> parsechecker on it. It parsed the contents correctly.
> I looked at the crawl dump generated by readdb and couldn't find any 
> errors or detailed information re: why something wasn't fetched.
>
> I'm at a loss here. How can I make nutch crawl the entire site and 
> fetch all the pages/documents? I'm talking about a site with about
> 40,000 pages, not millions.
>
> Thanks,
> Rashmi
>
> Confidentiality Notice::  This email, including attachments, may 
> include non-public, proprietary, confidential or legally privileged 
> information.
> If you are not an intended recipient or an authorized agent of an 
> intended recipient, you are hereby notified that any dissemination, 
> distribution or copying of the information contained in or transmitted 
> with this e-mail is unauthorized and strictly prohibited.  If you have 
> received this email in error, please notify the sender by replying to 
> this message and permanently delete this e-mail, its attachments, and 
> any copies of it immediately.  You should not retain, copy or use this 
> e-mail or any attachment for any purpose, nor disclose all or any part of the 
> contents to any other person.
> Thank you.
>

RE: nutch is not fetching all the pages

Reply via email to