I have small setup to index some files on a local box. Solr 5 Nutch 1.11
I thought I had it configured to not try any URLs that are not local to the system but it still seems to look for them. fetching http://www.cpsc.gov/Media/Documents/Regulations-Laws--Standards/Advisory-Opinions/Wheelchairs-145--/ (queue crawl delay=2000ms) fetching http://www.cpsc.gov/PageFiles/121846/fuclearance.pdf (queue crawl delay=2000ms) fetching http://www.cpsc.gov/Business--Manufacturing/Business-Education/Business-Guidance/Phthalates-Information/ (queue crawl delay=2000ms) -activeThreads=150, spinWaiting=148, fetchQueues.totalSize=2091, fetchQueues.getQueueCount=1 fetching http://www.cpsc.gov/es/Research--Statistics/ (queue crawl delay=2000ms) The regex-urlfilter.txt: # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # skip specific PDF files in the volumes directory -.*00(FRONT|INTRO)\.PDF.* #skip #-^(http|https)://www\.*$ #-^(http|https)://blogs\.*$ #-^(http|https)://store\.*$ #-^(http|https)://.*\.google.com/.*$ #-^(http|https)://nist.gov/.*$ # accept anything else #+. +^http://127.0.0.1:8080/cocoon I have searched and tried several things, including nutch-site.xml: <configuration> <property> <name>http.agent.name</name> <value>nutch-solr-integration</value> </property> <property> <name>generate.max.per.host</name> <value>1000</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts or domain will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. See 'db.ignore.external.links.mode'. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>0</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>fetcher.max.crawl.delay</name> <value>3</value> <description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description> </property> <property> <name>fetcher.queue.mode</name> <value>byHost</value> <description>Determines how to put URLs into queues. Default value is 'byHost', also takes 'byDomain' or 'byIP'. </description> </property> <property> <name>fetcher.verbose</name> <value>false</value> <description>If true, fetcher will log more verbosely.</description> </property> I inherited this and not that well versed on nutch. Many hours of searching, and trying what I have found but still no luck. Can't get it to just search the local system http://127.0.0.1:8080/cocoon And help would be greatly appreciated. -- Mitch Baker <[email protected]> LSA

