Thanks Lewis! In order to eliminate the regex filter as a potential problem, I removed it from my nutch-site.xml. As a prevention against potential size limitations I altered nutch-default.xml from its default value to -1. However, I really don't want Nutch to crawl the PDFs themselves, I just want to capture the link that leads to it.
I wasn't sure if my message to the list yesterday made sense in light of the fact that it had been an ongoing thread from another list so I posted a new thread just moments before I received your response. I apologize if this is bad form, but hopefully it makes more sense as to what I am trying to achieve. Thanks to you and all who support others through this list! -----Original Message----- From: Lewis John Mcgibbney [mailto:[email protected]] Sent: Wednesday, January 22, 2014 11:17 AM To: [email protected] Subject: Re: Crawling Websites for Links Hi Teague, On Wed, Jan 22, 2014 at 2:18 PM, <[email protected]> wrote: > > @Markus: When you say that the problem may be with url filters, what > can I do about that? By default Nutch uses a regex urlfilter for filtering out URLs which we assume ill ultimately mess up your crawlDB. You need to ensure that your expressions accommodate the tyoe and nature of WebPage's you wish to crawl. You can see the regex-urlfiter.txt within conf/ directory. There is also information on the Nutch wiki regarding how URL's can be added for certain domains, etc. Finally, there are some values you may wish to inspect and edit within nutch-site.xml prior to running your crawl. Some documents which are large (typically PDF's and such like) can be skipped if parsing takes too long or if the size of the document if over a certain threshold. I would take some time to inspect nutch-default.xml and learn about these properties prior to running your crawler. > How do I dump the linkdb to inspect it for URLs? I appreciate all the > help you've offered thus far! > http://wiki.apache.org/nutch/bin/nutch%20readdb hth Lewis

