Hey Ken, On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote:
> From my experience with Nutch and now Bixo, I think it's important to support > a -debug mode with tools that dumps out info about all decisions being made > on URLs, as otherwise tracking down what's going wrong with a crawl > (especially when doing test crawls) can be very painful. +1, agreed. If you're like me you result to inserting System.out.printlns everywhere :-) > > I have no idea where Nutch stands in this regard as of today, but I would > assume that it would be possible to generate information that would have > answered all of the "is it X" questions that came up during Chris's crawl. > E.g. > > - which URLs were put on the fetch list, versus skipped. > - which fetched documents were truncated. > - which URLs in a parsed page were skipped, due to the max outlinks per page > limit. > - which URLs got filtered by regex > These are great requirements for a debug tool. I've created a page on the Wiki for folks to contribute to/discuss: http://wiki.apache.org/nutch/DebugTool Thanks, Ken! Cheers, Chris > > On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote: > >> Hey Guys, >> >> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting >> all the at_download links. >> >> Phew! Who would have thought. Well glad Nutch is doing its thing, and >> doing it correctly! :-) >> >> Thanks guys. >> >> Cheers, >> Chris >> >> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: >> >>> Hey Guys, >>> >>> Here is the latest red-herring. I think I was using too small of a -topN >>> parameter >>> in my crawl, which was limiting the whole fetch. I was using -depth 10, and >>> -topN 10, >>> which thinking about it now was limiting to 100 pages at any depth level of >>> links, which >>> was too limited I think since most pages include > 100 pages in terms of >>> outlinks and >>> so forth. So parsing, regex, everything was working fine, it just wasn't >>> following the >>> links down, because it exceeded -topN * -depth. >>> >>> I'm running a new crawl now and it seems to be getting a TON more URLs. Full >>> crawls for me were limited to around ~5k URLs before which I think was the >>> problem. >>> Fingers crossed! >>> >>> Cheers, >>> Chris >>> >>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: >>> >>>> Hey Markus, >>>> >>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: >>>> >>>>> Hi Chris >>>>> >>>>> https://issues.apache.org/jira/browse/NUTCH-1087 >>>> >>>> Thanks for the pointer. I'll check it out. >>>> >>>>> >>>>> Use the org.apache.nutch.net.URLFilterChecker to test. >>>> >>>> Sweet, I didn't know about this tool. OK, I tried it out, check it (note >>>> that this includes my instrumented stuff, hence >>>> the printlns): >>>> >>>> echo >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file" >>>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [false] >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [false] >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [true] >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>>> >>>> So, looks like it didn't match the first 2 rules, but matched the 3rd one >>>> and thus it actually includes the URL fine. So, watch this, here are >>>> my 3 relevant rules: >>>> >>>> # skip file: ftp: and mailto: urls >>>> -^(file|ftp|mailto): >>>> >>>> # skip image and other suffixes we can't yet parse >>>> # for a more extensive coverage use the urlfilter-suffix plugin >>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ >>>> >>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/ >>>> >>>> So, that makes perfect sense. RegexURLFilter appears to be working >>>> normally, so that's fine. >>>> >>>> So, .... what's the deal, then? ParserChecker works fine, it shows that an >>>> outlink from this URL: >>>> >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >>>> >>>> Is in fact the at_download link: >>>> >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >>>> org.apache.nutch.parse.ParserChecker >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >>>> "download" >>>> outlink: toUrl: >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>>> anchor: watergat1summary.pdf >>>> [chipotle:local/nutch/framework] mattmann% >>>> >>>> RegexURLFilter takes in either of those URLs, and says they are fine: >>>> >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >>>> org.apache.nutch.parse.ParserChecker >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >>>> download | awk '{print $3}' | ./bin/nutch >>>> org.apache.nutch.net.URLFilterChecker -filterName >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [false] >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [false] >>>> @#((#(#@ EVALUATING at_download LINK!: >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >>>> matched? [true] >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >>>> [chipotle:local/nutch/framework] mattmann% >>>> >>>> [chipotle:local/nutch/framework] mattmann% echo >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | >>>> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>>> doesn't have at_download in it! >>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>>> doesn't have at_download in it! >>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >>>> doesn't have at_download in it! >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >>>> [chipotle:local/nutch/framework] mattmann% >>>> >>>> Any idea why i wouldn't get getting the at_download URLs downloaded then? >>>> Here's http.content.limit, >>>> db.max.outlinks from my Nutch conf: >>>> >>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 >>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml >>>> confuse this setting with the http.content.limit setting. >>>> </description> >>>> -- >>>> <name>http.content.limit</name> >>>> <value>-1</value> >>>> -- >>>> <name>db.max.outlinks.per.page</name> >>>> <value>-1</value> >>>> -- >>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >>>> outlinks >>>> will be processed for a page; otherwise, all outlinks will be processed. >>>> [chipotle:local/nutch/framework] mattmann% >>>> >>>> >>>> Cheers, >>>> Chris >>>> >>>>> >>>>>> Hey Markus, >>>>>> >>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: >>>>>>>> Hey Markus, >>>>>>>> >>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: >>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl >>>>>>>>> command. I don't know what happens if it isnt there. >>>>>>>> >>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built >>>>>>>> conf directory in runtime/local/conf from 1.4? >>>>>>> >>>>>>> Its gone! I checked and last saw it in 1.2. Strange >>>>>>> >>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you >>>>>>>>> ask me. >>>>>>>> >>>>>>>> I'd be in favor of replacing the current Crawl command with a simple >>>>>>>> Java driver that just calls the underlying Inject, Generate, and Fetch >>>>>>>> tools. Would that work? >>>>>>> >>>>>>> There's an open issue to replace it with a basic crawl shell script. >>>>>>> It's >>>>>>> easier to understand and uses the same commands. Non-Java users should >>>>>>> be able to deal with it better, and provide us with better problem >>>>>>> descriptions. >>>>>> >>>>>> +1, that would be cool indeed. Do you know what issue it is? >>>>>> >>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out >>>>>> if it's dropping the at_download URLs for whatever reason. Sigh. >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >>>>>>>>>> Hi Marek, >>>>>>>>>> >>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>>>>>>>>>> I think when you use the crawl command instead of the single >>>>>>>>>>> commands, you have to specify the regEx rules in the >>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case >>>>>>>>>>> in 1.4 >>>>>>>>>>> >>>>>>>>>>> Could that be the problem? >>>>>>>>>> >>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. >>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by >>>>>>>>>> default and shipped with the basic config. >>>>>>>>>> >>>>>>>>>> Thanks for trying to help though. I'm going to figure this out! Or, >>>>>>>>>> someone is going to probably tell me what I'm doing wrong. >>>>>>>>>> We'll see what happens first :-) >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Chris >>>>>>>>>> >>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>>>>>>>>>> global, top-level conf/nutch-default.xml, >>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling >>>>>>>>>>>>>> is forging ahead. >>>>>>>>>>>>> >>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why >>>>>>>>>>>>> I suggested that we deliver the content of runtime/local as our >>>>>>>>>>>>> binary release for next time. Most people use Nutch in local >>>>>>>>>>>>> mode so this would make their lives easier, as for the advanced >>>>>>>>>>>>> users (read pseudo or real distributed) they need to recompile >>>>>>>>>>>>> the job file anyway and I'd expect them to use the src release >>>>>>>>>>>> >>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>>>>>>>>>> >>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to >>>>>>>>>>>> crawl the PDFs... :( >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Chris >>>>>>>>>>>> >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>>>>> Senior Computer Scientist >>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>>>>> Email: [email protected] >>>>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>>> Senior Computer Scientist >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>>> Email: [email protected] >>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Senior Computer Scientist >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>> Email: [email protected] >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

