One comment, from watching this email thread at a distance. From my experience with Nutch and now Bixo, I think it's important to support a -debug mode with tools that dumps out info about all decisions being made on URLs, as otherwise tracking down what's going wrong with a crawl (especially when doing test crawls) can be very painful.
I have no idea where Nutch stands in this regard as of today, but I would assume that it would be possible to generate information that would have answered all of the "is it X" questions that came up during Chris's crawl. E.g. - which URLs were put on the fetch list, versus skipped. - which fetched documents were truncated. - which URLs in a parsed page were skipped, due to the max outlinks per page limit. - which URLs got filtered by regex and so on -- Ken On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote: > Hey Guys, > > Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting > all the at_download links. > > Phew! Who would have thought. Well glad Nutch is doing its thing, and > doing it correctly! :-) > > Thanks guys. > > Cheers, > Chris > > On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: > >> Hey Guys, >> >> Here is the latest red-herring. I think I was using too small of a -topN >> parameter >> in my crawl, which was limiting the whole fetch. I was using -depth 10, and >> -topN 10, >> which thinking about it now was limiting to 100 pages at any depth level of >> links, which >> was too limited I think since most pages include > 100 pages in terms of >> outlinks and >> so forth. So parsing, regex, everything was working fine, it just wasn't >> following the >> links down, because it exceeded -topN * -depth. >> >> I'm running a new crawl now and it seems to be getting a TON more URLs. Full >> crawls for me were limited to around ~5k URLs before which I think was the >> problem. >> Fingers crossed! >> >> Cheers, >> Chris >> >> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: >> >>> Hey Markus, >>> >>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: >>> >>>> Hi Chris >>>> >>>> https://issues.apache.org/jira/browse/NUTCH-1087 >>> >>> Thanks for the pointer. I'll check it out. >>> >>>> >>>> Use the org.apache.nutch.net.URLFilterChecker to test. >>> >>> Sweet, I didn't know about this tool. OK, I tried it out, check it (note >>> that this includes my instrumented stuff, hence >>> the printlns): >>> >>> echo >>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file" >>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [false] >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [false] >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [true] >>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>> >>> So, looks like it didn't match the first 2 rules, but matched the 3rd one >>> and thus it actually includes the URL fine. So, watch this, here are >>> my 3 relevant rules: >>> >>> # skip file: ftp: and mailto: urls >>> -^(file|ftp|mailto): >>> >>> # skip image and other suffixes we can't yet parse >>> # for a more extensive coverage use the urlfilter-suffix plugin >>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ >>> >>> +^http://([a-z0-9]*\.)*vault.fbi.gov/ >>> >>> So, that makes perfect sense. RegexURLFilter appears to be working >>> normally, so that's fine. >>> >>> So, .... what's the deal, then? ParserChecker works fine, it shows that an >>> outlink from this URL: >>> >>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >>> >>> Is in fact the at_download link: >>> >>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >>> org.apache.nutch.parse.ParserChecker >>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >>> "download" >>> outlink: toUrl: >>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>> anchor: watergat1summary.pdf >>> [chipotle:local/nutch/framework] mattmann% >>> >>> RegexURLFilter takes in either of those URLs, and says they are fine: >>> >>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >>> org.apache.nutch.parse.ParserChecker >>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >>> download | awk '{print $3}' | ./bin/nutch >>> org.apache.nutch.net.URLFilterChecker -filterName >>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [false] >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [false] >>> @#((#(#@ EVALUATING at_download LINK!: >>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>> matched? [true] >>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>> [chipotle:local/nutch/framework] mattmann% >>> >>> [chipotle:local/nutch/framework] mattmann% echo >>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | >>> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>> doesn't have at_download in it! >>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>> doesn't have at_download in it! >>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>> doesn't have at_download in it! >>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >>> [chipotle:local/nutch/framework] mattmann% >>> >>> Any idea why i wouldn't get getting the at_download URLs downloaded then? >>> Here's http.content.limit, >>> db.max.outlinks from my Nutch conf: >>> >>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 >>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml >>> confuse this setting with the http.content.limit setting. >>> </description> >>> -- >>> <name>http.content.limit</name> >>> <value>-1</value> >>> -- >>> <name>db.max.outlinks.per.page</name> >>> <value>-1</value> >>> -- >>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >>> outlinks >>> will be processed for a page; otherwise, all outlinks will be processed. >>> [chipotle:local/nutch/framework] mattmann% >>> >>> >>> Cheers, >>> Chris >>> >>>> >>>>> Hey Markus, >>>>> >>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: >>>>>>> Hey Markus, >>>>>>> >>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: >>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl >>>>>>>> command. I don't know what happens if it isnt there. >>>>>>> >>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built >>>>>>> conf directory in runtime/local/conf from 1.4? >>>>>> >>>>>> Its gone! I checked and last saw it in 1.2. Strange >>>>>> >>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you >>>>>>>> ask me. >>>>>>> >>>>>>> I'd be in favor of replacing the current Crawl command with a simple >>>>>>> Java driver that just calls the underlying Inject, Generate, and Fetch >>>>>>> tools. Would that work? >>>>>> >>>>>> There's an open issue to replace it with a basic crawl shell script. It's >>>>>> easier to understand and uses the same commands. Non-Java users should >>>>>> be able to deal with it better, and provide us with better problem >>>>>> descriptions. >>>>> >>>>> +1, that would be cool indeed. Do you know what issue it is? >>>>> >>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out >>>>> if it's dropping the at_download URLs for whatever reason. Sigh. >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >>>>>>>>> Hi Marek, >>>>>>>>> >>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>>>>>>>>> I think when you use the crawl command instead of the single >>>>>>>>>> commands, you have to specify the regEx rules in the >>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case >>>>>>>>>> in 1.4 >>>>>>>>>> >>>>>>>>>> Could that be the problem? >>>>>>>>> >>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. >>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by >>>>>>>>> default and shipped with the basic config. >>>>>>>>> >>>>>>>>> Thanks for trying to help though. I'm going to figure this out! Or, >>>>>>>>> someone is going to probably tell me what I'm doing wrong. >>>>>>>>> We'll see what happens first :-) >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Chris >>>>>>>>> >>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>>>>>>>>> global, top-level conf/nutch-default.xml, >>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling >>>>>>>>>>>>> is forging ahead. >>>>>>>>>>>> >>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why >>>>>>>>>>>> I suggested that we deliver the content of runtime/local as our >>>>>>>>>>>> binary release for next time. Most people use Nutch in local >>>>>>>>>>>> mode so this would make their lives easier, as for the advanced >>>>>>>>>>>> users (read pseudo or real distributed) they need to recompile >>>>>>>>>>>> the job file anyway and I'd expect them to use the src release >>>>>>>>>>> >>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>>>>>>>>> >>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to >>>>>>>>>>> crawl the PDFs... :( >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Chris >>>>>>>>>>> >>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>>>> Senior Computer Scientist >>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>>>> Email: [email protected] >>>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>> Senior Computer Scientist >>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>> Email: [email protected] >>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Senior Computer Scientist >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>> Email: [email protected] >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

