OK, nm. This *is* different behavior from 1.3 apparently, but I figured out how to make it work in 1.4 (instead of editing the global, top-level conf/nutch-default.xml, I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging ahead.
I'll report back on if I'm able to grab the PDFs or not, using 1.4... Cheers, Chris On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote: > *really* weird. > > With 1.4, even though I have my http.agent.name property set in > conf/nutch-default.xml, > it keeps telling me this: > > Fetcher: No agents listed in 'http.agent.name' property. > Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No > agents listed in 'http.agent.name' property. > at > org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > [chipotle:local/nutch/framework] mattmann% > > When I try and crawl. > > Is nutch-default.xml not read by the crawl command in 1.4? > > Cheers, > Chris > > > On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote: > >> Can you also try with trunk or 1.4? I get different output with >> parsechecker >> such as a proper title. >> >> >> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch >> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- >> of-02/at_download/file >> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- >> of-02/at_download/file >> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- >> of-02/at_download/file >> contentType: application/pdf >> signature: 818fd03d7f9011b4f7000657e2aaf966 >> --------- >> Url >> --------------- >> http://vault.fbi.gov/watergate/watergate-summary-part-02- >> of-02/at_download/file--------- >> ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: Watergate Summary Part 02 of 02 >> Outlinks: 0 >> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 >> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; >> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >> Server=HTML >> Cache-Control=max-age=604800 >> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z >> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter >> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- >> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf >> creator=FBI >> >> >> >>> Hey Markus, >>> >>> I set the http.content.limit to -1, so it shouldn't have a limit. >>> >>> I'll try injecting that single URL and see if I can get it to download >>> using separate commands and see what happens! :-) >>> >>> Cheers, >>> Chris >>> >>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: >>>> What's your http.content.limit set to? Does it allow for a 1.2MB file? >>>> Can you also check without merging segments? Or as a last resort, inject >>>> that single URL in an empty crawl db and do a single crawl cycle, >>>> preferably by using separate commands instead of the crawl command? >>>> >>>>> Hey Guys, >>>>> >>>>> I'm using Nutch 1.3, and trying to get it to crawl: >>>>> >>>>> http://vault.fbi.gov/ >>>>> >>>>> My regex-url filter diff is: >>>>> >>>>> # accept anything else >>>>> #+. >>>>> >>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/ >>>>> >>>>> I'm trying to get it to parse PDFs like: >>>>> >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>> ad/ file >>>>> >>>>> I see that my config ParserChecker lets me parse it OK: >>>>> >>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >>>>> org.apache.nutch.parse.ParserChecker >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>> ad /file fetching: >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>> ad /file parsing: >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>> ad /file contentType: application/pdf >>>>> --------- >>>>> Url >>>>> --------------- >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>> ad/ file--------- ParseData >>>>> --------- >>>>> Version: 5 >>>>> Status: success(1,0) >>>>> Title: >>>>> Outlinks: 0 >>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT >>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT >>>>> Content-Disposition=attachment; filename="watergat2.pdf" >>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close >>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML >>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >>>>> Content-Type=application/pdf >>>>> >>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in >>>>> terms of the plugin.includes (as it looks like parse-tika) is included >>>>> and handles * contentType. >>>>> >>>>> I see in my crawl log if I merge the segs, and dump them and then grep >>>>> for URL, I see it getting to like: >>>>> >>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >>>>> >>>>> That type of URL, but then not grabbing the PDF once it parses it, or >>>>> adding it to the outlinks, as I never see a: >>>>> >>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil >>>>> e >>>>> >>>>> In the URL list. >>>>> >>>>> I'm running this command to crawl: >>>>> >>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >>>>> >>>>> Any idea what I'm doing wrong? >>>>> >>>>> Cheers >>>>> Chris >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

