Hey Markus, I set the http.content.limit to -1, so it shouldn't have a limit.
I'll try injecting that single URL and see if I can get it to download using separate commands and see what happens! :-) Cheers, Chris On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: > What's your http.content.limit set to? Does it allow for a 1.2MB file? Can > you > also check without merging segments? Or as a last resort, inject that single > URL in an empty crawl db and do a single crawl cycle, preferably by using > separate commands instead of the crawl command? > > >> Hey Guys, >> >> I'm using Nutch 1.3, and trying to get it to crawl: >> >> http://vault.fbi.gov/ >> >> My regex-url filter diff is: >> >> # accept anything else >> #+. >> >> +^http://([a-z0-9*\.)*vault.fbi.gov/ >> >> I'm trying to get it to parse PDFs like: >> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/ >> file >> >> I see that my config ParserChecker lets me parse it OK: >> >> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >> org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download >> /file fetching: >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download >> /file parsing: >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download >> /file contentType: application/pdf >> --------- >> Url >> --------------- >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/ >> file--------- ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: >> Outlinks: 0 >> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493 >> Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment; >> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >> Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >> Content-Type=application/pdf >> >> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms >> of the plugin.includes (as it looks like parse-tika) is included and >> handles * contentType. >> >> I see in my crawl log if I merge the segs, and dump them and then grep for >> URL, I see it getting to like: >> >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >> >> That type of URL, but then not grabbing the PDF once it parses it, or >> adding it to the outlinks, as I never see a: >> >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file >> >> In the URL list. >> >> I'm running this command to crawl: >> >> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >> >> Any idea what I'm doing wrong? >> >> Cheers >> Chris >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

