RE: Nutching IRS: Solved problem with URL file
the urls file needed http://www.irs.gov/index.html without the index.html it did not work. I fixed my server problem too!. Now IRS has been nutched (to some extent and the results can be seen here). My objective for using nutch is to 1. Hopefully learn something 2. Create an index of tax related web pages that are relevant to tax software developers. I noticed something about not being able to parse PDF in the log file (is that true?) I cant wait to nutch some more. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Saturday, February 25, 2006 11:50 PM To: 'nutch-agent@lucene.apache.org' Subject: Nutching IRS Hi, I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest CYGWIN. I think I am about 99% of the way there, but I finally hit a stumbling block. I followed the instructions to a T, setup the war in the the root context, modified the config files, etc., set env NUTCH_JAVA_HOME, etc. I have 2 problems 1. The crawl doesn;t seem to be working. The crawled dir gets created, but see the log below. 0 records processed . My second problem is with the servlet (see 2. below). Thanks in advance for the help. crawl-urlfilter.txt -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe|png|PNG)$ [EMAIL PROTECTED] +^http://([a-z0-9]*\.)* irs.gov/ -. urls http://www.irs.gov/ Log: run java in C:\Program Files\Java\jdk1.5.0_04 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml 060225 233931 No FS indicated, using default:local 060225 233931 crawl started in: crawled 060225 233931 rootUrlFile = urls 060225 233931 threads = 10 060225 233931 depth = 3 060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db 060225 233932 Starting URL processing 060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins 060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2 060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons 060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml 060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060225 233932 not including: T:\nutch-0.7.1\plugins\index-more 060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier 060225 233932 parsing: T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml 060225 233932 not including: T:\nutch-0.7.1\plugins\ontology 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp 060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml 060225 233932 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient 060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\query-more 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix 060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml 060225 233933 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060225 233933 found resource crawl-urlfilter.txt at file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt .060225 233933 Added 0 pages 060225 233933 FetchListTool started 060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233933 Overall processing: Sorted NaN entries/second 060225 233933 FetchListTool completed 060225 233933 logging at INFO 060225 233934 Updating T:\nutch-0.7.1\crawled\db 060225 233934 Updating for T:\nutch-0.7.1\crawled\segments\20060225233933 060
Nutching IRS
Hi, I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest CYGWIN. I think I am about 99% of the way there, but I finally hit a stumbling block. I followed the instructions to a T, setup the war in the the root context, modified the config files, etc., set env NUTCH_JAVA_HOME, etc. I have 2 problems 1. The crawl doesn;t seem to be working. The crawled dir gets created, but see the log below. 0 records processed . My second problem is with the servlet (see 2. below). Thanks in advance for the help. crawl-urlfilter.txt -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe|png|PNG)$ [EMAIL PROTECTED] +^http://([a-z0-9]*\.)* irs.gov/ -. urls http://www.irs.gov/ Log: run java in C:\Program Files\Java\jdk1.5.0_04 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml 060225 233931 No FS indicated, using default:local 060225 233931 crawl started in: crawled 060225 233931 rootUrlFile = urls 060225 233931 threads = 10 060225 233931 depth = 3 060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db 060225 233932 Starting URL processing 060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins 060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2 060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons 060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml 060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060225 233932 not including: T:\nutch-0.7.1\plugins\index-more 060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier 060225 233932 parsing: T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml 060225 233932 not including: T:\nutch-0.7.1\plugins\ontology 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp 060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml 060225 233932 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient 060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\query-more 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix 060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml 060225 233933 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060225 233933 found resource crawl-urlfilter.txt at file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt .060225 233933 Added 0 pages 060225 233933 FetchListTool started 060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233933 Overall processing: Sorted NaN entries/second 060225 233933 FetchListTool completed 060225 233933 logging at INFO 060225 233934 Updating T:\nutch-0.7.1\crawled\db 060225 233934 Updating for T:\nutch-0.7.1\crawled\segments\20060225233933 060225 233934 Finishing update 060225 233934 Update finished 060225 233934 FetchListTool started 060225 233935 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233935 Overall processing: Sorted NaN entries/second 060225 233935 FetchListTool completed 060225 233935 logging at INFO 060225 233936 Updating T:\nutch-0.7.1\crawled\db 060225 233936 Updating for T:\nutch-0.7.1\crawled\segments\20060225233934 060225 233936 Finishing update 060225 233936 Update finished 060225 233936 FetchListTool started 060225 233936 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233936 Overall processing: Sorted NaN entries/second 060225 233936 FetchListTool completed 060225 233936 log