RE: Nutching IRS: Solved problem with URL file

2006-02-26 Thread Richard Braman
the urls file needed http://www.irs.gov/index.html
without the index.html it did not work.
 
I fixed my server problem too!.  Now IRS has been nutched (to some
extent and the results can be seen here).  
My objective for using nutch is to 
1.  Hopefully learn something
2. Create an index of tax related web pages that are relevant to tax
software developers.  
 
I noticed something about not being able to parse PDF in the log file
(is that true?)
 
I cant wait to nutch some more.
 
 

-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 25, 2006 11:50 PM
To: 'nutch-agent@lucene.apache.org'
Subject: Nutching IRS


Hi,
 
 I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest
CYGWIN. I think I am about 99% of the way there, but I finally hit a
stumbling block.  I followed the instructions to a T, setup the war in
the the root context, modified the config files, etc., set env
NUTCH_JAVA_HOME, etc.  I have 2 problems 
 
1. The crawl doesn;t seem to be working.  The crawled dir gets created,
but see the log below. 0 records processed
.  My second problem is with the servlet (see 2. below).  Thanks in
advance for the help.
crawl-urlfilter.txt
 
 
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG)$
[EMAIL PROTECTED]
+^http://([a-z0-9]*\.)* irs.gov/
-.
 
urls
http://www.irs.gov/
 
Log:
run java in C:\Program Files\Java\jdk1.5.0_04
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml
060225 233931 No FS indicated, using default:local
060225 233931 crawl started in: crawled
060225 233931 rootUrlFile = urls
060225 233931 threads = 10
060225 233931 depth = 3
060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db
060225 233932 Starting URL processing
060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins
060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2
060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons
060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml
060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060225 233932 not including: T:\nutch-0.7.1\plugins\index-more
060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier
060225 233932 parsing:
T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
060225 233932 not including: T:\nutch-0.7.1\plugins\ontology
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp
060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml
060225 233932 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient
060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\query-more
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix
060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
060225 233933 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060225 233933 found resource crawl-urlfilter.txt at
file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt
.060225 233933 Added 0 pages
060225 233933 FetchListTool started
060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233933 Overall processing: Sorted NaN entries/second
060225 233933 FetchListTool completed
060225 233933 logging at INFO
060225 233934 Updating T:\nutch-0.7.1\crawled\db
060225 233934 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233933
060

Nutching IRS

2006-02-25 Thread Richard Braman
Hi,
 
 I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest
CYGWIN. I think I am about 99% of the way there, but I finally hit a
stumbling block.  I followed the instructions to a T, setup the war in
the the root context, modified the config files, etc., set env
NUTCH_JAVA_HOME, etc.  I have 2 problems 
 
1. The crawl doesn;t seem to be working.  The crawled dir gets created,
but see the log below. 0 records processed
.  My second problem is with the servlet (see 2. below).  Thanks in
advance for the help.
crawl-urlfilter.txt
 
 
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG)$
[EMAIL PROTECTED]
+^http://([a-z0-9]*\.)* irs.gov/
-.
 
urls
http://www.irs.gov/
 
Log:
run java in C:\Program Files\Java\jdk1.5.0_04
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml
060225 233931 No FS indicated, using default:local
060225 233931 crawl started in: crawled
060225 233931 rootUrlFile = urls
060225 233931 threads = 10
060225 233931 depth = 3
060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db
060225 233932 Starting URL processing
060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins
060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2
060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons
060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml
060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060225 233932 not including: T:\nutch-0.7.1\plugins\index-more
060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier
060225 233932 parsing:
T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
060225 233932 not including: T:\nutch-0.7.1\plugins\ontology
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp
060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml
060225 233932 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient
060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\query-more
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix
060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
060225 233933 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060225 233933 found resource crawl-urlfilter.txt at
file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt
.060225 233933 Added 0 pages
060225 233933 FetchListTool started
060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233933 Overall processing: Sorted NaN entries/second
060225 233933 FetchListTool completed
060225 233933 logging at INFO
060225 233934 Updating T:\nutch-0.7.1\crawled\db
060225 233934 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233934 Finishing update
060225 233934 Update finished
060225 233934 FetchListTool started
060225 233935 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233935 Overall processing: Sorted NaN entries/second
060225 233935 FetchListTool completed
060225 233935 logging at INFO
060225 233936 Updating T:\nutch-0.7.1\crawled\db
060225 233936 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233936 Finishing update
060225 233936 Update finished
060225 233936 FetchListTool started
060225 233936 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233936 Overall processing: Sorted NaN entries/second
060225 233936 FetchListTool completed
060225 233936 log