What's your http.content.limit set to? Does it allow for a 1.2MB file? Can you also check without merging segments? Or as a last resort, inject that single URL in an empty crawl db and do a single crawl cycle, preferably by using separate commands instead of the crawl command?
> Hey Guys, > > I'm using Nutch 1.3, and trying to get it to crawl: > > http://vault.fbi.gov/ > > My regex-url filter diff is: > > # accept anything else > #+. > > +^http://([a-z0-9*\.)*vault.fbi.gov/ > > I'm trying to get it to parse PDFs like: > > http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/ > file > > I see that my config ParserChecker lets me parse it OK: > > [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download > /file fetching: > http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download > /file parsing: > http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download > /file contentType: application/pdf > --------- > Url > --------------- > http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_download/ > file--------- ParseData > --------- > Version: 5 > Status: success(1,0) > Title: > Outlinks: 0 > Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT Content-Length=1228493 > Expires=Wed, 30 Nov 2011 21:55:46 GMT Content-Disposition=attachment; > filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT > Connection=close Accept-Ranges=bytes Content-Type=application/pdf > Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 > Content-Type=application/pdf > > I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in terms > of the plugin.includes (as it looks like parse-tika) is included and > handles * contentType. > > I see in my crawl log if I merge the segs, and dump them and then grep for > URL, I see it getting to like: > > http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view > > That type of URL, but then not grabbing the PDF once it parses it, or > adding it to the outlinks, as I never see a: > > http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/file > > In the URL list. > > I'm running this command to crawl: > > ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 > > Any idea what I'm doing wrong? > > Cheers > Chris > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

