Umm...sigh, that didn't solve it. I'll keep looking.
Cheers, Chris On Nov 23, 2011, at 9:11 PM, Mattmann, Chris A (388J) wrote: > Uh...oh...I think I might have figured it out: > > http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F > > Check this: > > [chipotle:local/nutch/framework] mattmann% ./bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep > outlink | wc -l > 169 > > Hmmm...running test crawl right now with db.max.outlinks.per.page set to > -1.... > > Cheers, > Chris > > On Nov 23, 2011, at 8:52 PM, Mattmann, Chris A (388J) wrote: > >> Here's a real use case too: >> >> ./bin/nutch org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >> >> That produces, as one of its outlinks: >> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >> org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >> download >> outlink: toUrl: >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> anchor: watergat1summary.pdf >> [chipotle:local/nutch/framework] mattmann% >> >> That's correct. However, it doesn't seem like this outlink is being read at >> least during the fetch/generate/crawl cycle, as >> I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse >> the URL just fine b/c if I run ParserChecker >> direct to that URL, I see: >> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >> org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> fetching: >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> parsing: >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> contentType: application/pdf >> --------- >> Url >> --------------- >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file--------- >> ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: Watergate Summary Part 01 of 02 >> Outlinks: 2 >> outlink: toUrl: Li:92 anchor: >> outlink: toUrl: u92.:n. anchor: >> Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 >> Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; >> filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >> Server=HTML Cache-Control=max-age=604800 >> Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z >> created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter >> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT >> Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI >> [chipotle:local/nutch/framework] mattmann% >> >> I'll keep digging. I wonder if it's a regex thing. I commented out >> *everything* in my regex-urlfilter.txt besides: >> >> +^http://([a-z0-9]*\.)*vault.fbi.gov/ >> >> It seems to get EVERYTHING on the site *but* these dang at_download URLs. >> >> Cheers, >> Chris >> >> On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote: >> >>> OK, it didn't work again: here are the URLs from a full crawl cycle: >>> >>> http://pastebin.com/Jx3Ar6Md >>> >>> When run independently, where I seed it with an *at_download* URL, >>> direct to the PDF, it parses the PDF. But when I run it like normal with >>> topN 10 and >>> depth 10, it doesn't pick them up. >>> >>> /me stumped >>> >>> I'll poke around in the code but was just wondering if I was doing something >>> wrong. >>> >>> Cheers, >>> Chris >>> >>> On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote: >>> >>>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out >>>> how to make it work in 1.4 (instead of editing the global, top-level >>>> conf/nutch-default.xml, >>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is >>>> forging ahead. >>>> >>>> I'll report back on if I'm able to grab the PDFs or not, using 1.4... >>>> >>>> Cheers, >>>> Chris >>>> >>>> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote: >>>> >>>>> *really* weird. >>>>> >>>>> With 1.4, even though I have my http.agent.name property set in >>>>> conf/nutch-default.xml, >>>>> it keeps telling me this: >>>>> >>>>> Fetcher: No agents listed in 'http.agent.name' property. >>>>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: >>>>> No agents listed in 'http.agent.name' property. >>>>> at >>>>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261) >>>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166) >>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>>> [chipotle:local/nutch/framework] mattmann% >>>>> >>>>> When I try and crawl. >>>>> >>>>> Is nutch-default.xml not read by the crawl command in 1.4? >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> >>>>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote: >>>>> >>>>>> Can you also try with trunk or 1.4? I get different output with >>>>>> parsechecker >>>>>> such as a proper title. >>>>>> >>>>>> >>>>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch >>>>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>>> of-02/at_download/file >>>>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>>> of-02/at_download/file >>>>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>>> of-02/at_download/file >>>>>> contentType: application/pdf >>>>>> signature: 818fd03d7f9011b4f7000657e2aaf966 >>>>>> --------- >>>>>> Url >>>>>> --------------- >>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>>> of-02/at_download/file--------- >>>>>> ParseData >>>>>> --------- >>>>>> Version: 5 >>>>>> Status: success(1,0) >>>>>> Title: Watergate Summary Part 02 of 02 >>>>>> Outlinks: 0 >>>>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT >>>>>> Content-Length=1228493 >>>>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; >>>>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >>>>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >>>>>> Server=HTML >>>>>> Cache-Control=max-age=604800 >>>>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z >>>>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat >>>>>> PDFWriter >>>>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- >>>>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf >>>>>> creator=FBI >>>>>> >>>>>> >>>>>> >>>>>>> Hey Markus, >>>>>>> >>>>>>> I set the http.content.limit to -1, so it shouldn't have a limit. >>>>>>> >>>>>>> I'll try injecting that single URL and see if I can get it to download >>>>>>> using separate commands and see what happens! :-) >>>>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: >>>>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file? >>>>>>>> Can you also check without merging segments? Or as a last resort, >>>>>>>> inject >>>>>>>> that single URL in an empty crawl db and do a single crawl cycle, >>>>>>>> preferably by using separate commands instead of the crawl command? >>>>>>>> >>>>>>>>> Hey Guys, >>>>>>>>> >>>>>>>>> I'm using Nutch 1.3, and trying to get it to crawl: >>>>>>>>> >>>>>>>>> http://vault.fbi.gov/ >>>>>>>>> >>>>>>>>> My regex-url filter diff is: >>>>>>>>> >>>>>>>>> # accept anything else >>>>>>>>> #+. >>>>>>>>> >>>>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/ >>>>>>>>> >>>>>>>>> I'm trying to get it to parse PDFs like: >>>>>>>>> >>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>>> ad/ file >>>>>>>>> >>>>>>>>> I see that my config ParserChecker lets me parse it OK: >>>>>>>>> >>>>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >>>>>>>>> org.apache.nutch.parse.ParserChecker >>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>>> ad /file fetching: >>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>>> ad /file parsing: >>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>>> ad /file contentType: application/pdf >>>>>>>>> --------- >>>>>>>>> Url >>>>>>>>> --------------- >>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>>> ad/ file--------- ParseData >>>>>>>>> --------- >>>>>>>>> Version: 5 >>>>>>>>> Status: success(1,0) >>>>>>>>> Title: >>>>>>>>> Outlinks: 0 >>>>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT >>>>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT >>>>>>>>> Content-Disposition=attachment; filename="watergat2.pdf" >>>>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close >>>>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML >>>>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >>>>>>>>> Content-Type=application/pdf >>>>>>>>> >>>>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in >>>>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included >>>>>>>>> and handles * contentType. >>>>>>>>> >>>>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep >>>>>>>>> for URL, I see it getting to like: >>>>>>>>> >>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >>>>>>>>> >>>>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or >>>>>>>>> adding it to the outlinks, as I never see a: >>>>>>>>> >>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil >>>>>>>>> e >>>>>>>>> >>>>>>>>> In the URL list. >>>>>>>>> >>>>>>>>> I'm running this command to crawl: >>>>>>>>> >>>>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >>>>>>>>> >>>>>>>>> Any idea what I'm doing wrong? >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Chris >>>>>>>>> >>>>>>>>> >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>> Senior Computer Scientist >>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>> Email: [email protected] >>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Senior Computer Scientist >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>> Email: [email protected] >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

