OK, it didn't work again: here are the URLs from a full crawl cycle: http://pastebin.com/Jx3Ar6Md
When run independently, where I seed it with an *at_download* URL, direct to the PDF, it parses the PDF. But when I run it like normal with topN 10 and depth 10, it doesn't pick them up. /me stumped I'll poke around in the code but was just wondering if I was doing something wrong. Cheers, Chris On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote: > OK, nm. This *is* different behavior from 1.3 apparently, but I figured out > how to make it work in 1.4 (instead of editing the global, top-level > conf/nutch-default.xml, > I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging > ahead. > > I'll report back on if I'm able to grab the PDFs or not, using 1.4... > > Cheers, > Chris > > On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote: > >> *really* weird. >> >> With 1.4, even though I have my http.agent.name property set in >> conf/nutch-default.xml, >> it keeps telling me this: >> >> Fetcher: No agents listed in 'http.agent.name' property. >> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No >> agents listed in 'http.agent.name' property. >> at >> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261) >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166) >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >> [chipotle:local/nutch/framework] mattmann% >> >> When I try and crawl. >> >> Is nutch-default.xml not read by the crawl command in 1.4? >> >> Cheers, >> Chris >> >> >> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote: >> >>> Can you also try with trunk or 1.4? I get different output with >>> parsechecker >>> such as a proper title. >>> >>> >>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch >>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- >>> of-02/at_download/file >>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>> of-02/at_download/file >>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>> of-02/at_download/file >>> contentType: application/pdf >>> signature: 818fd03d7f9011b4f7000657e2aaf966 >>> --------- >>> Url >>> --------------- >>> http://vault.fbi.gov/watergate/watergate-summary-part-02- >>> of-02/at_download/file--------- >>> ParseData >>> --------- >>> Version: 5 >>> Status: success(1,0) >>> Title: Watergate Summary Part 02 of 02 >>> Outlinks: 0 >>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 >>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; >>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >>> Server=HTML >>> Cache-Control=max-age=604800 >>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z >>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter >>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- >>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf >>> creator=FBI >>> >>> >>> >>>> Hey Markus, >>>> >>>> I set the http.content.limit to -1, so it shouldn't have a limit. >>>> >>>> I'll try injecting that single URL and see if I can get it to download >>>> using separate commands and see what happens! :-) >>>> >>>> Cheers, >>>> Chris >>>> >>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: >>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file? >>>>> Can you also check without merging segments? Or as a last resort, inject >>>>> that single URL in an empty crawl db and do a single crawl cycle, >>>>> preferably by using separate commands instead of the crawl command? >>>>> >>>>>> Hey Guys, >>>>>> >>>>>> I'm using Nutch 1.3, and trying to get it to crawl: >>>>>> >>>>>> http://vault.fbi.gov/ >>>>>> >>>>>> My regex-url filter diff is: >>>>>> >>>>>> # accept anything else >>>>>> #+. >>>>>> >>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/ >>>>>> >>>>>> I'm trying to get it to parse PDFs like: >>>>>> >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>> ad/ file >>>>>> >>>>>> I see that my config ParserChecker lets me parse it OK: >>>>>> >>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >>>>>> org.apache.nutch.parse.ParserChecker >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>> ad /file fetching: >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>> ad /file parsing: >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>> ad /file contentType: application/pdf >>>>>> --------- >>>>>> Url >>>>>> --------------- >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>> ad/ file--------- ParseData >>>>>> --------- >>>>>> Version: 5 >>>>>> Status: success(1,0) >>>>>> Title: >>>>>> Outlinks: 0 >>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT >>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT >>>>>> Content-Disposition=attachment; filename="watergat2.pdf" >>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close >>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML >>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >>>>>> Content-Type=application/pdf >>>>>> >>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in >>>>>> terms of the plugin.includes (as it looks like parse-tika) is included >>>>>> and handles * contentType. >>>>>> >>>>>> I see in my crawl log if I merge the segs, and dump them and then grep >>>>>> for URL, I see it getting to like: >>>>>> >>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >>>>>> >>>>>> That type of URL, but then not grabbing the PDF once it parses it, or >>>>>> adding it to the outlinks, as I never see a: >>>>>> >>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil >>>>>> e >>>>>> >>>>>> In the URL list. >>>>>> >>>>>> I'm running this command to crawl: >>>>>> >>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >>>>>> >>>>>> Any idea what I'm doing wrong? >>>>>> >>>>>> Cheers >>>>>> Chris >>>>>> >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

