Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 15:47:02 -0800

Hey Markus,

I set the http.content.limit to -1, so it shouldn't have a limit.


I'll try injecting that single URL and see if I can get it to download
using separate commands and see what happens! :-)

Cheers,
Chris

On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:

> What's your http.content.limit set to? Does it allow for a 1.2MB file? Can 
> you 
> also check without merging segments? Or as a last resort, inject that single 
> URL in an empty crawl db and do a single crawl cycle, preferably by using 
> separate commands instead of the crawl command?
> 
> 
>> Hey Guys,
>> 
>> I'm using Nutch 1.3, and trying to get it to crawl:
>> 
>> http://vault.fbi.gov/
>> 
>> My regex-url filter diff is:
>> 
>> # accept anything else
>> #+.
>> 
>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>> 
>> I'm trying to get it to parse PDFs like:
>> 
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/
>> file
>> 
>> I see that my config ParserChecker lets me parse it OK:
>> 
>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>> org.apache.nutch.parse.ParserChecker
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
>> /file fetching:
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
>> /file parsing:
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download
>> /file contentType: application/pdf
>> ---------
>> Url
>> ---------------
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/
>> file--------- ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title:
>> Outlinks: 0
>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493
>> Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment;
>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT
>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf
>> Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>> Content-Type=application/pdf
>> 
>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms
>> of the plugin.includes (as it looks like parse-tika) is included and
>> handles * contentType.
>> 
>> I see in my crawl log if I merge the segs, and dump them and then grep for
>> URL, I see it getting to like:
>> 
>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>> 
>> That type of URL, but then not grabbing the PDF once it parses it, or
>> adding it to the outlinks, as I never see a:
>> 
>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file
>> 
>> In the URL list.
>> 
>> I'm running this command to crawl:
>> 
>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>> 
>> Any idea what I'm doing wrong?
>> 
>> Cheers
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to