Quite an episode indeed! Is it possible to further this on Dev list?
New thread, focused subject... relative outcomes (fingers crossed) :0) On Fri, Nov 25, 2011 at 4:49 PM, Markus Jelsma <[email protected]>wrote: > > > On Friday 25 November 2011 17:36:36 Mattmann, Chris A (388J) wrote: > > Hey Ken, > > > > On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote: > > > From my experience with Nutch and now Bixo, I think it's important to > > > support a -debug mode with tools that dumps out info about all > decisions > > > being made on URLs, as otherwise tracking down what's going wrong with > a > > > crawl (especially when doing test crawls) can be very painful. > > > > +1, agreed. If you're like me you result to inserting System.out.printlns > > everywhere :-) > > > I do that all the time when producing code. Setting log level to debug is > too > overwhelming in some cases. > > > > I have no idea where Nutch stands in this regard as of today, but I > would > > > assume that it would be possible to generate information that would > have > > > answered all of the "is it X" questions that came up during Chris's > > > crawl. E.g. > > > > > > - which URLs were put on the fetch list, versus skipped. > > > - which fetched documents were truncated. > > > - which URLs in a parsed page were skipped, due to the max outlinks per > > > page limit. - which URLs got filtered by regex > > > > These are great requirements for a debug tool. I've created a page on the > > Wiki for folks to contribute to/discuss: > > > > http://wiki.apache.org/nutch/DebugTool > > > > Thanks, Ken! > > > > Cheers, > > Chris > > > > > On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote: > > >> Hey Guys, > > >> > > >> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm > getting > > >> all the at_download links. > > >> > > >> Phew! Who would have thought. Well glad Nutch is doing its thing, and > > >> doing it correctly! :-) > > >> > > >> Thanks guys. > > >> > > >> Cheers, > > >> Chris > > >> > > >> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: > > >>> Hey Guys, > > >>> > > >>> Here is the latest red-herring. I think I was using too small of a > > >>> -topN parameter in my crawl, which was limiting the whole fetch. I > was > > >>> using -depth 10, and -topN 10, which thinking about it now was > > >>> limiting to 100 pages at any depth level of links, which was too > > >>> limited I think since most pages include > 100 pages in terms of > > >>> outlinks and so forth. So parsing, regex, everything was working > fine, > > >>> it just wasn't following the links down, because it exceeded -topN * > > >>> -depth. > > >>> > > >>> I'm running a new crawl now and it seems to be getting a TON more > URLs. > > >>> Full crawls for me were limited to around ~5k URLs before which I > > >>> think was the problem. Fingers crossed! > > >>> > > >>> Cheers, > > >>> Chris > > >>> > > >>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: > > >>>> Hey Markus, > > >>>> > > >>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: > > >>>>> Hi Chris > > >>>>> > > >>>>> https://issues.apache.org/jira/browse/NUTCH-1087 > > >>>> > > >>>> Thanks for the pointer. I'll check it out. > > >>>> > > >>>>> Use the org.apache.nutch.net.URLFilterChecker to test. > > >>>> > > >>>> Sweet, I didn't know about this tool. OK, I tried it out, check it > > >>>> (note that this includes my instrumented stuff, hence the printlns): > > >>>> > > >>>> echo > > >>>> " > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker > > >>>> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking > > >>>> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ > > >>>> EVALUATING at_download LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download > LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download > LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [true] > > >>>> + > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file > > >>>> > > >>>> So, looks like it didn't match the first 2 rules, but matched the > 3rd > > >>>> one and thus it actually includes the URL fine. So, watch this, here > > >>>> are my 3 relevant rules: > > >>>> > > >>>> # skip file: ftp: and mailto: urls > > >>>> -^(file|ftp|mailto): > > >>>> > > >>>> # skip image and other suffixes we can't yet parse > > >>>> # for a more extensive coverage use the urlfilter-suffix plugin > > >>>> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zi > > >>>> > p|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp > > >>>> eg|JPEG|bmp|BMP|js|JS)$ > > >>>> > > >>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/ > > >>>> > > >>>> So, that makes perfect sense. RegexURLFilter appears to be working > > >>>> normally, so that's fine. > > >>>> > > >>>> So, .... what's the deal, then? ParserChecker works fine, it shows > > >>>> that an outlink from this URL: > > >>>> > > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > > >>>> > > >>>> Is in fact the at_download link: > > >>>> > > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > > >>>> org.apache.nutch.parse.ParserChecker > > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view| > > >>>> grep "download" outlink: toUrl: > > >>>> > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_dow > > >>>> nload/file anchor: watergat1summary.pdf > > >>>> [chipotle:local/nutch/framework] mattmann% > > >>>> > > >>>> RegexURLFilter takes in either of those URLs, and says they are > fine: > > >>>> > > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > > >>>> org.apache.nutch.parse.ParserChecker > > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view| > > >>>> grep download | awk '{print $3}' | ./bin/nutch > > >>>> org.apache.nutch.net.URLFilterChecker -filterName > > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING > > >>>> at_download LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download > LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download > LINK!: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file]: matched? [true] > > >>>> + > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > > >>>> wnload/file [chipotle:local/nutch/framework] mattmann% > > >>>> > > >>>> [chipotle:local/nutch/framework] mattmann% echo > > >>>> " > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" > > >>>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter URL: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > > >>>> doesn't have at_download in it! URL: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > > >>>> doesn't have at_download in it! URL: > > >>>> [ > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > > >>>> doesn't have at_download in it! > > >>>> + > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > > >>>> [chipotle:local/nutch/framework] mattmann% > > >>>> > > >>>> Any idea why i wouldn't get getting the at_download URLs downloaded > > >>>> then? Here's http.content.limit, db.max.outlinks from my Nutch conf: > > >>>> > > >>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 > > >>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml > > >>>> confuse this setting with the http.content.limit setting. > > >>>> </description> > > >>>> -- > > >>>> <name>http.content.limit</name> > > >>>> <value>-1</value> > > >>>> -- > > >>>> <name>db.max.outlinks.per.page</name> > > >>>> <value>-1</value> > > >>>> -- > > >>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page > > >>>> outlinks will be processed for a page; otherwise, all outlinks will > > >>>> be processed. [chipotle:local/nutch/framework] mattmann% > > >>>> > > >>>> > > >>>> Cheers, > > >>>> Chris > > >>>> > > >>>>>> Hey Markus, > > >>>>>> > > >>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: > > >>>>>>>> Hey Markus, > > >>>>>>>> > > >>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: > > >>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl > > >>>>>>>>> command. I don't know what happens if it isnt there. > > >>>>>>>> > > >>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my > > >>>>>>>> built conf directory in runtime/local/conf from 1.4? > > >>>>>>> > > >>>>>>> Its gone! I checked and last saw it in 1.2. Strange > > >>>>>>> > > >>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 > if > > >>>>>>>>> you ask me. > > >>>>>>>> > > >>>>>>>> I'd be in favor of replacing the current Crawl command with a > > >>>>>>>> simple Java driver that just calls the underlying Inject, > > >>>>>>>> Generate, and Fetch tools. Would that work? > > >>>>>>> > > >>>>>>> There's an open issue to replace it with a basic crawl shell > > >>>>>>> script. It's easier to understand and uses the same commands. > > >>>>>>> Non-Java users should be able to deal with it better, and provide > > >>>>>>> us with better problem descriptions. > > >>>>>> > > >>>>>> +1, that would be cool indeed. Do you know what issue it is? > > >>>>>> > > >>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can > figure > > >>>>>> out if it's dropping the at_download URLs for whatever reason. > > >>>>>> Sigh. > > >>>>>> > > >>>>>> Cheers, > > >>>>>> Chris > > >>>>>> > > >>>>>>>> Cheers, > > >>>>>>>> Chris > > >>>>>>>> > > >>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) > wrote: > > >>>>>>>>>> Hi Marek, > > >>>>>>>>>> > > >>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > > >>>>>>>>>>> I think when you use the crawl command instead of the single > > >>>>>>>>>>> commands, you have to specify the regEx rules in the > > >>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the > > >>>>>>>>>>> case in 1.4 > > >>>>>>>>>>> > > >>>>>>>>>>> Could that be the problem? > > >>>>>>>>>> > > >>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf > dir. > > >>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled > by > > >>>>>>>>>> default and shipped with the basic config. > > >>>>>>>>>> > > >>>>>>>>>> Thanks for trying to help though. I'm going to figure this > out! > > >>>>>>>>>> Or, someone is going to probably tell me what I'm doing wrong. > > >>>>>>>>>> We'll see what happens first :-) > > >>>>>>>>>> > > >>>>>>>>>> Cheers, > > >>>>>>>>>> Chris > > >>>>>>>>>> > > >>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > > >>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > > >>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, > > >>>>>>>>>>>>>> but I figured out how to make it work in 1.4 (instead of > > >>>>>>>>>>>>>> editing the global, top-level conf/nutch-default.xml, > > >>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). > > >>>>>>>>>>>>>> Crawling is forging ahead. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is > partially > > >>>>>>>>>>>>> why I suggested that we deliver the content of > runtime/local > > >>>>>>>>>>>>> as our binary release for next time. Most people use Nutch > > >>>>>>>>>>>>> in local mode so this would make their lives easier, as for > > >>>>>>>>>>>>> the advanced users (read pseudo or real distributed) they > > >>>>>>>>>>>>> need to recompile the job file anyway and I'd expect them > to > > >>>>>>>>>>>>> use the src release > > >>>>>>>>>>>> > > >>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for > > >>>>>>>>>>>> 1.5. > > >>>>>>>>>>>> > > >>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get > it > > >>>>>>>>>>>> to crawl the PDFs... :( > > >>>>>>>>>>>> > > >>>>>>>>>>>> Cheers, > > >>>>>>>>>>>> Chris > > >>>>>>>>>>>> > > >>>>>>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>>>> ++++ Chris Mattmann, Ph.D. > > >>>>>>>>>>>> Senior Computer Scientist > > >>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246 > > >>>>>>>>>>>> Email: [email protected] > > >>>>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>>>>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>>>> ++++ Adjunct Assistant Professor, Computer Science > Department > > >>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>>>>>>>>>> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>>>> +++++ > > >>>>>>>>>> > > >>>>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>> ++ Chris Mattmann, Ph.D. > > >>>>>>>>>> Senior Computer Scientist > > >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>>>>>>>> Office: 171-266B, Mailstop: 171-246 > > >>>>>>>>>> Email: [email protected] > > >>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department > > >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>>>>>>>> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>>>> +++ > > >>>>>>>> > > >>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>> Chris Mattmann, Ph.D. > > >>>>>>>> Senior Computer Scientist > > >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>>>>>> Office: 171-266B, Mailstop: 171-246 > > >>>>>>>> Email: [email protected] > > >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>>>> Adjunct Assistant Professor, Computer Science Department > > >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>>>>>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>> > > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>> Chris Mattmann, Ph.D. > > >>>>>> Senior Computer Scientist > > >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>>>> Office: 171-266B, Mailstop: 171-246 > > >>>>>> Email: [email protected] > > >>>>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>>> Adjunct Assistant Professor, Computer Science Department > > >>>>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>> > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>> Chris Mattmann, Ph.D. > > >>>> Senior Computer Scientist > > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>> Office: 171-266B, Mailstop: 171-246 > > >>>> Email: [email protected] > > >>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>> Adjunct Assistant Professor, Computer Science Department > > >>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>> > > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>> Chris Mattmann, Ph.D. > > >>> Senior Computer Scientist > > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>> Office: 171-266B, Mailstop: 171-246 > > >>> Email: [email protected] > > >>> WWW: http://sunset.usc.edu/~mattmann/ > > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>> Adjunct Assistant Professor, Computer Science Department > > >>> University of Southern California, Los Angeles, CA 90089 USA > > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Chris Mattmann, Ph.D. > > >> Senior Computer Scientist > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >> Office: 171-266B, Mailstop: 171-246 > > >> Email: [email protected] > > >> WWW: http://sunset.usc.edu/~mattmann/ > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Adjunct Assistant Professor, Computer Science Department > > >> University of Southern California, Los Angeles, CA 90089 USA > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -------------------------- > > > Ken Krugler > > > http://www.scaleunlimited.com > > > custom big data solutions & training > > > Hadoop, Cascading, Mahout & Solr > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- > Markus Jelsma - CTO - Openindex > -- *Lewis*

