Re: Can't get Nutch to crawl PDFs

Markus Jelsma Wed, 23 Nov 2011 15:30:51 -0800

What's your http.content.limit set to? Does it allow for a 1.2MB file? Can you 
also check without merging segments? Or as a last resort, inject that single 
URL in an empty crawl db and do a single crawl cycle, preferably by using 
separate commands instead of the crawl command?



> Hey Guys,
> 
> I'm using Nutch 1.3, and trying to get it to crawl:
> 
> http://vault.fbi.gov/
> 
> My regex-url filter diff is:
> 
> # accept anything else
> #+.
> 
> +^http://([a-z0-9*\.)*vault.fbi.gov/
> 
> I'm trying to get it to parse PDFs like:
> 
> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/
> file
> 
> I see that my config ParserChecker lets me parse it OK:
> 
> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
> org.apache.nutch.parse.ParserChecker
> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
> /file fetching:
> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
> /file parsing:
> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
> /file contentType: application/pdf
> ---------
> Url
> ---------------
> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/
> file--------- ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493
> Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment;
> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT
> Connection=close Accept-Ranges=bytes Content-Type=application/pdf
> Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
> Content-Type=application/pdf
> 
> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms
> of the plugin.includes (as it looks like parse-tika) is included and
> handles * contentType.
> 
> I see in my crawl log if I merge the segs, and dump them and then grep for
> URL, I see it getting to like:
> 
> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
> 
> That type of URL, but then not grabbing the PDF once it parses it, or
> adding it to the outlinks, as I never see a:
> 
> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file
> 
> In the URL list.
> 
> I'm running this command to crawl:
> 
> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
> 
> Any idea what I'm doing wrong?
> 
> Cheers
> Chris
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to