Hey Guys, I'm using Nutch 1.3, and trying to get it to crawl:
http://vault.fbi.gov/ My regex-url filter diff is: # accept anything else #+. +^http://([a-z0-9*\.)*vault.fbi.gov/ I'm trying to get it to parse PDFs like: http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file I see that my config ParserChecker lets me parse it OK: [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file contentType: application/pdf --------- Url --------------- http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/file--------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment; filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 Content-Type=application/pdf I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms of the plugin.includes (as it looks like parse-tika) is included and handles * contentType. I see in my crawl log if I merge the segs, and dump them and then grep for URL, I see it getting to like: http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view That type of URL, but then not grabbing the PDF once it parses it, or adding it to the outlinks, as I never see a: http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file In the URL list. I'm running this command to crawl: ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 Any idea what I'm doing wrong? Cheers Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

