Thanks Lewis!

In order to eliminate the regex filter as a potential problem, I removed it
from my nutch-site.xml. As a prevention against potential size limitations I
altered nutch-default.xml from its default value to -1. However, I really
don't want Nutch to crawl the PDFs themselves, I just want to capture the
link that leads to it.

I wasn't sure if my message to the list yesterday made sense in light of the
fact that it had been an ongoing thread from another list so I posted a new
thread just moments before I received your response. I apologize if this is
bad form, but hopefully it makes more sense as to what I am trying to
achieve.

Thanks to you and all who support others through this list!

-----Original Message-----
From: Lewis John Mcgibbney [mailto:[email protected]] 
Sent: Wednesday, January 22, 2014 11:17 AM
To: [email protected]
Subject: Re: Crawling Websites for Links

Hi Teague,

On Wed, Jan 22, 2014 at 2:18 PM, <[email protected]> wrote:

>
> @Markus: When you say that the problem may be with url filters, what 
> can I do about that?


By default Nutch uses a regex urlfilter for filtering out URLs which we
assume ill ultimately mess up your crawlDB. You need to ensure that your
expressions accommodate the tyoe and nature of WebPage's you wish to crawl.
You can see the regex-urlfiter.txt within conf/ directory. There is also
information on the Nutch wiki regarding how URL's can be added for certain
domains, etc.
Finally, there are some values you may wish to inspect and edit within
nutch-site.xml prior to running your crawl. Some documents which are large
(typically PDF's and such like) can be skipped if parsing takes too long or
if the size of the document if over a certain threshold. I would take some
time to inspect nutch-default.xml and learn about these properties prior to
running your crawler.


> How do I dump the linkdb to inspect it for URLs? I appreciate all the 
> help you've offered thus far!
>

http://wiki.apache.org/nutch/bin/nutch%20readdb

hth
Lewis

Reply via email to