Hey Guys,

I'm using Nutch 1.3, and trying to get it to crawl:

http://vault.fbi.gov/

My regex-url filter diff is:

# accept anything else
#+.

+^http://([a-z0-9*\.)*vault.fbi.gov/

I'm trying to get it to parse PDFs like:

http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file

I see that my config ParserChecker lets me parse it OK:

[chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file
fetching: 
http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file
parsing: 
http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file
contentType: application/pdf
---------
Url
---------------
http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493 
Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment; 
filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML 
Cache-Control=max-age=604800 
Parse Metadata: xmpTPg:NPages=0 Content-Type=application/pdf 

I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms of 
the plugin.includes (as it looks like parse-tika) is 
included and handles * contentType. 

I see in my crawl log if I merge the segs, and dump them and then grep for URL, 
I see it getting to like:

http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view

That type of URL, but then not grabbing the PDF once it parses it, or adding it 
to the outlinks, as I never see a:

http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file

In the URL list. 

I'm running this command to crawl:

./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10

Any idea what I'm doing wrong?

Cheers
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to