Can you also try with trunk or 1.4? I get different output with parsechecker such as a proper title.
markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- of-02/at_download/file fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- of-02/at_download/file parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- of-02/at_download/file contentType: application/pdf signature: 818fd03d7f9011b4f7000657e2aaf966 --------- Url --------------- http://vault.fbi.gov/watergate/watergate-summary-part-02- of-02/at_download/file--------- ParseData --------- Version: 5 Status: success(1,0) Title: Watergate Summary Part 02 of 02 Outlinks: 0 Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf creator=FBI > Hey Markus, > > I set the http.content.limit to -1, so it shouldn't have a limit. > > I'll try injecting that single URL and see if I can get it to download > using separate commands and see what happens! :-) > > Cheers, > Chris > > On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: > > What's your http.content.limit set to? Does it allow for a 1.2MB file? > > Can you also check without merging segments? Or as a last resort, inject > > that single URL in an empty crawl db and do a single crawl cycle, > > preferably by using separate commands instead of the crawl command? > > > >> Hey Guys, > >> > >> I'm using Nutch 1.3, and trying to get it to crawl: > >> > >> http://vault.fbi.gov/ > >> > >> My regex-url filter diff is: > >> > >> # accept anything else > >> #+. > >> > >> +^http://([a-z0-9*\.)*vault.fbi.gov/ > >> > >> I'm trying to get it to parse PDFs like: > >> > >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo > >> ad/ file > >> > >> I see that my config ParserChecker lets me parse it OK: > >> > >> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch > >> org.apache.nutch.parse.ParserChecker > >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo > >> ad /file fetching: > >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo > >> ad /file parsing: > >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo > >> ad /file contentType: application/pdf > >> --------- > >> Url > >> --------------- > >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo > >> ad/ file--------- ParseData > >> --------- > >> Version: 5 > >> Status: success(1,0) > >> Title: > >> Outlinks: 0 > >> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT > >> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT > >> Content-Disposition=attachment; filename="watergat2.pdf" > >> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close > >> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML > >> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 > >> Content-Type=application/pdf > >> > >> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in > >> terms of the plugin.includes (as it looks like parse-tika) is included > >> and handles * contentType. > >> > >> I see in my crawl log if I merge the segs, and dump them and then grep > >> for URL, I see it getting to like: > >> > >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view > >> > >> That type of URL, but then not grabbing the PDF once it parses it, or > >> adding it to the outlinks, as I never see a: > >> > >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil > >> e > >> > >> In the URL list. > >> > >> I'm running this command to crawl: > >> > >> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 > >> > >> Any idea what I'm doing wrong? > >> > >> Cheers > >> Chris > >> > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

