Uh...oh...I think I might have figured it out: http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F
Check this: [chipotle:local/nutch/framework] mattmann% ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep outlink | wc -l 169 Hmmm...running test crawl right now with db.max.outlinks.per.page set to -1.... Cheers, Chris On Nov 23, 2011, at 8:52 PM, Mattmann, Chris A (388J) wrote: > Here's a real use case too: > > ./bin/nutch org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > > That produces, as one of its outlinks: > > [chipotle:local/nutch/framework] mattmann% ./bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep > download > outlink: toUrl: > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > anchor: watergat1summary.pdf > [chipotle:local/nutch/framework] mattmann% > > That's correct. However, it doesn't seem like this outlink is being read at > least during the fetch/generate/crawl cycle, as > I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse > the URL just fine b/c if I run ParserChecker > direct to that URL, I see: > > [chipotle:local/nutch/framework] mattmann% ./bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > fetching: > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > parsing: > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > contentType: application/pdf > --------- > Url > --------------- > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Watergate Summary Part 01 of 02 > Outlinks: 2 > outlink: toUrl: Li:92 anchor: > outlink: toUrl: u92.:n. anchor: > Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 > Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; > filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT > Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML > Cache-Control=max-age=604800 > Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z > created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter > 2.01 for Windows; modified using iText 2.1.7 by 1T3XT > Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI > [chipotle:local/nutch/framework] mattmann% > > I'll keep digging. I wonder if it's a regex thing. I commented out > *everything* in my regex-urlfilter.txt besides: > > +^http://([a-z0-9]*\.)*vault.fbi.gov/ > > It seems to get EVERYTHING on the site *but* these dang at_download URLs. > > Cheers, > Chris > > On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote: > >> OK, it didn't work again: here are the URLs from a full crawl cycle: >> >> http://pastebin.com/Jx3Ar6Md >> >> When run independently, where I seed it with an *at_download* URL, >> direct to the PDF, it parses the PDF. But when I run it like normal with >> topN 10 and >> depth 10, it doesn't pick them up. >> >> /me stumped >> >> I'll poke around in the code but was just wondering if I was doing something >> wrong. >> >> Cheers, >> Chris >> >> On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote: >> >>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out >>> how to make it work in 1.4 (instead of editing the global, top-level >>> conf/nutch-default.xml, >>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging >>> ahead. >>> >>> I'll report back on if I'm able to grab the PDFs or not, using 1.4... >>> >>> Cheers, >>> Chris >>> >>> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote: >>> >>>> *really* weird. >>>> >>>> With 1.4, even though I have my http.agent.name property set in >>>> conf/nutch-default.xml, >>>> it keeps telling me this: >>>> >>>> Fetcher: No agents listed in 'http.agent.name' property. >>>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No >>>> agents listed in 'http.agent.name' property. >>>> at >>>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261) >>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166) >>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>> [chipotle:local/nutch/framework] mattmann% >>>> >>>> When I try and crawl. >>>> >>>> Is nutch-default.xml not read by the crawl command in 1.4? >>>> >>>> Cheers, >>>> Chris >>>> >>>> >>>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote: >>>> >>>>> Can you also try with trunk or 1.4? I get different output with >>>>> parsechecker >>>>> such as a proper title. >>>>> >>>>> >>>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch >>>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>> of-02/at_download/file >>>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>> of-02/at_download/file >>>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>> of-02/at_download/file >>>>> contentType: application/pdf >>>>> signature: 818fd03d7f9011b4f7000657e2aaf966 >>>>> --------- >>>>> Url >>>>> --------------- >>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02- >>>>> of-02/at_download/file--------- >>>>> ParseData >>>>> --------- >>>>> Version: 5 >>>>> Status: success(1,0) >>>>> Title: Watergate Summary Part 02 of 02 >>>>> Outlinks: 0 >>>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT >>>>> Content-Length=1228493 >>>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; >>>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT >>>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf >>>>> Server=HTML >>>>> Cache-Control=max-age=604800 >>>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z >>>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter >>>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last- >>>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf >>>>> creator=FBI >>>>> >>>>> >>>>> >>>>>> Hey Markus, >>>>>> >>>>>> I set the http.content.limit to -1, so it shouldn't have a limit. >>>>>> >>>>>> I'll try injecting that single URL and see if I can get it to download >>>>>> using separate commands and see what happens! :-) >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote: >>>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file? >>>>>>> Can you also check without merging segments? Or as a last resort, inject >>>>>>> that single URL in an empty crawl db and do a single crawl cycle, >>>>>>> preferably by using separate commands instead of the crawl command? >>>>>>> >>>>>>>> Hey Guys, >>>>>>>> >>>>>>>> I'm using Nutch 1.3, and trying to get it to crawl: >>>>>>>> >>>>>>>> http://vault.fbi.gov/ >>>>>>>> >>>>>>>> My regex-url filter diff is: >>>>>>>> >>>>>>>> # accept anything else >>>>>>>> #+. >>>>>>>> >>>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/ >>>>>>>> >>>>>>>> I'm trying to get it to parse PDFs like: >>>>>>>> >>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>> ad/ file >>>>>>>> >>>>>>>> I see that my config ParserChecker lets me parse it OK: >>>>>>>> >>>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch >>>>>>>> org.apache.nutch.parse.ParserChecker >>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>> ad /file fetching: >>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>> ad /file parsing: >>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>> ad /file contentType: application/pdf >>>>>>>> --------- >>>>>>>> Url >>>>>>>> --------------- >>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo >>>>>>>> ad/ file--------- ParseData >>>>>>>> --------- >>>>>>>> Version: 5 >>>>>>>> Status: success(1,0) >>>>>>>> Title: >>>>>>>> Outlinks: 0 >>>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT >>>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT >>>>>>>> Content-Disposition=attachment; filename="watergat2.pdf" >>>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close >>>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML >>>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0 >>>>>>>> Content-Type=application/pdf >>>>>>>> >>>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in >>>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included >>>>>>>> and handles * contentType. >>>>>>>> >>>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep >>>>>>>> for URL, I see it getting to like: >>>>>>>> >>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view >>>>>>>> >>>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or >>>>>>>> adding it to the outlinks, as I never see a: >>>>>>>> >>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil >>>>>>>> e >>>>>>>> >>>>>>>> In the URL list. >>>>>>>> >>>>>>>> I'm running this command to crawl: >>>>>>>> >>>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10 >>>>>>>> >>>>>>>> Any idea what I'm doing wrong? >>>>>>>> >>>>>>>> Cheers >>>>>>>> Chris >>>>>>>> >>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Senior Computer Scientist >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>> Email: [email protected] >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

