On Friday 25 November 2011 17:36:36 Mattmann, Chris A (388J) wrote: > Hey Ken, > > On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote: > > From my experience with Nutch and now Bixo, I think it's important to > > support a -debug mode with tools that dumps out info about all decisions > > being made on URLs, as otherwise tracking down what's going wrong with a > > crawl (especially when doing test crawls) can be very painful. > > +1, agreed. If you're like me you result to inserting System.out.printlns > everywhere :-) > I do that all the time when producing code. Setting log level to debug is too overwhelming in some cases.
> > I have no idea where Nutch stands in this regard as of today, but I would > > assume that it would be possible to generate information that would have > > answered all of the "is it X" questions that came up during Chris's > > crawl. E.g. > > > > - which URLs were put on the fetch list, versus skipped. > > - which fetched documents were truncated. > > - which URLs in a parsed page were skipped, due to the max outlinks per > > page limit. - which URLs got filtered by regex > > These are great requirements for a debug tool. I've created a page on the > Wiki for folks to contribute to/discuss: > > http://wiki.apache.org/nutch/DebugTool > > Thanks, Ken! > > Cheers, > Chris > > > On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote: > >> Hey Guys, > >> > >> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting > >> all the at_download links. > >> > >> Phew! Who would have thought. Well glad Nutch is doing its thing, and > >> doing it correctly! :-) > >> > >> Thanks guys. > >> > >> Cheers, > >> Chris > >> > >> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: > >>> Hey Guys, > >>> > >>> Here is the latest red-herring. I think I was using too small of a > >>> -topN parameter in my crawl, which was limiting the whole fetch. I was > >>> using -depth 10, and -topN 10, which thinking about it now was > >>> limiting to 100 pages at any depth level of links, which was too > >>> limited I think since most pages include > 100 pages in terms of > >>> outlinks and so forth. So parsing, regex, everything was working fine, > >>> it just wasn't following the links down, because it exceeded -topN * > >>> -depth. > >>> > >>> I'm running a new crawl now and it seems to be getting a TON more URLs. > >>> Full crawls for me were limited to around ~5k URLs before which I > >>> think was the problem. Fingers crossed! > >>> > >>> Cheers, > >>> Chris > >>> > >>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: > >>>> Hey Markus, > >>>> > >>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: > >>>>> Hi Chris > >>>>> > >>>>> https://issues.apache.org/jira/browse/NUTCH-1087 > >>>> > >>>> Thanks for the pointer. I'll check it out. > >>>> > >>>>> Use the org.apache.nutch.net.URLFilterChecker to test. > >>>> > >>>> Sweet, I didn't know about this tool. OK, I tried it out, check it > >>>> (note that this includes my instrumented stuff, hence the printlns): > >>>> > >>>> echo > >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker > >>>> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking > >>>> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ > >>>> EVALUATING at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [true] > >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file > >>>> > >>>> So, looks like it didn't match the first 2 rules, but matched the 3rd > >>>> one and thus it actually includes the URL fine. So, watch this, here > >>>> are my 3 relevant rules: > >>>> > >>>> # skip file: ftp: and mailto: urls > >>>> -^(file|ftp|mailto): > >>>> > >>>> # skip image and other suffixes we can't yet parse > >>>> # for a more extensive coverage use the urlfilter-suffix plugin > >>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zi > >>>> p|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp > >>>> eg|JPEG|bmp|BMP|js|JS)$ > >>>> > >>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/ > >>>> > >>>> So, that makes perfect sense. RegexURLFilter appears to be working > >>>> normally, so that's fine. > >>>> > >>>> So, .... what's the deal, then? ParserChecker works fine, it shows > >>>> that an outlink from this URL: > >>>> > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > >>>> > >>>> Is in fact the at_download link: > >>>> > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > >>>> org.apache.nutch.parse.ParserChecker > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | > >>>> grep "download" outlink: toUrl: > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_dow > >>>> nload/file anchor: watergat1summary.pdf > >>>> [chipotle:local/nutch/framework] mattmann% > >>>> > >>>> RegexURLFilter takes in either of those URLs, and says they are fine: > >>>> > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > >>>> org.apache.nutch.parse.ParserChecker > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | > >>>> grep download | awk '{print $3}' | ./bin/nutch > >>>> org.apache.nutch.net.URLFilterChecker -filterName > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING > >>>> at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file]: matched? [true] > >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do > >>>> wnload/file [chipotle:local/nutch/framework] mattmann% > >>>> > >>>> [chipotle:local/nutch/framework] mattmann% echo > >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" > >>>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter URL: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >>>> doesn't have at_download in it! URL: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >>>> doesn't have at_download in it! URL: > >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >>>> doesn't have at_download in it! > >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > >>>> [chipotle:local/nutch/framework] mattmann% > >>>> > >>>> Any idea why i wouldn't get getting the at_download URLs downloaded > >>>> then? Here's http.content.limit, db.max.outlinks from my Nutch conf: > >>>> > >>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 > >>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml > >>>> confuse this setting with the http.content.limit setting. > >>>> </description> > >>>> -- > >>>> <name>http.content.limit</name> > >>>> <value>-1</value> > >>>> -- > >>>> <name>db.max.outlinks.per.page</name> > >>>> <value>-1</value> > >>>> -- > >>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page > >>>> outlinks will be processed for a page; otherwise, all outlinks will > >>>> be processed. [chipotle:local/nutch/framework] mattmann% > >>>> > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>>>> Hey Markus, > >>>>>> > >>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: > >>>>>>>> Hey Markus, > >>>>>>>> > >>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: > >>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl > >>>>>>>>> command. I don't know what happens if it isnt there. > >>>>>>>> > >>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my > >>>>>>>> built conf directory in runtime/local/conf from 1.4? > >>>>>>> > >>>>>>> Its gone! I checked and last saw it in 1.2. Strange > >>>>>>> > >>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if > >>>>>>>>> you ask me. > >>>>>>>> > >>>>>>>> I'd be in favor of replacing the current Crawl command with a > >>>>>>>> simple Java driver that just calls the underlying Inject, > >>>>>>>> Generate, and Fetch tools. Would that work? > >>>>>>> > >>>>>>> There's an open issue to replace it with a basic crawl shell > >>>>>>> script. It's easier to understand and uses the same commands. > >>>>>>> Non-Java users should be able to deal with it better, and provide > >>>>>>> us with better problem descriptions. > >>>>>> > >>>>>> +1, that would be cool indeed. Do you know what issue it is? > >>>>>> > >>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure > >>>>>> out if it's dropping the at_download URLs for whatever reason. > >>>>>> Sigh. > >>>>>> > >>>>>> Cheers, > >>>>>> Chris > >>>>>> > >>>>>>>> Cheers, > >>>>>>>> Chris > >>>>>>>> > >>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: > >>>>>>>>>> Hi Marek, > >>>>>>>>>> > >>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > >>>>>>>>>>> I think when you use the crawl command instead of the single > >>>>>>>>>>> commands, you have to specify the regEx rules in the > >>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the > >>>>>>>>>>> case in 1.4 > >>>>>>>>>>> > >>>>>>>>>>> Could that be the problem? > >>>>>>>>>> > >>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. > >>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by > >>>>>>>>>> default and shipped with the basic config. > >>>>>>>>>> > >>>>>>>>>> Thanks for trying to help though. I'm going to figure this out! > >>>>>>>>>> Or, someone is going to probably tell me what I'm doing wrong. > >>>>>>>>>> We'll see what happens first :-) > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> Chris > >>>>>>>>>> > >>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > >>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > >>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, > >>>>>>>>>>>>>> but I figured out how to make it work in 1.4 (instead of > >>>>>>>>>>>>>> editing the global, top-level conf/nutch-default.xml, > >>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). > >>>>>>>>>>>>>> Crawling is forging ahead. > >>>>>>>>>>>>> > >>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially > >>>>>>>>>>>>> why I suggested that we deliver the content of runtime/local > >>>>>>>>>>>>> as our binary release for next time. Most people use Nutch > >>>>>>>>>>>>> in local mode so this would make their lives easier, as for > >>>>>>>>>>>>> the advanced users (read pseudo or real distributed) they > >>>>>>>>>>>>> need to recompile the job file anyway and I'd expect them to > >>>>>>>>>>>>> use the src release > >>>>>>>>>>>> > >>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for > >>>>>>>>>>>> 1.5. > >>>>>>>>>>>> > >>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it > >>>>>>>>>>>> to crawl the PDFs... :( > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers, > >>>>>>>>>>>> Chris > >>>>>>>>>>>> > >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>>>> ++++ Chris Mattmann, Ph.D. > >>>>>>>>>>>> Senior Computer Scientist > >>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>>>>>>>> Email: [email protected] > >>>>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>>>> ++++ Adjunct Assistant Professor, Computer Science Department > >>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>>>> +++++ > >>>>>>>>>> > >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> ++ Chris Mattmann, Ph.D. > >>>>>>>>>> Senior Computer Scientist > >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>>>>>> Email: [email protected] > >>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department > >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> +++ > >>>>>>>> > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> Chris Mattmann, Ph.D. > >>>>>>>> Senior Computer Scientist > >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>>>> Email: [email protected] > >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> Adjunct Assistant Professor, Computer Science Department > >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Chris Mattmann, Ph.D. > >>>>>> Senior Computer Scientist > >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>> Email: [email protected] > >>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Adjunct Assistant Professor, Computer Science Department > >>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Senior Computer Scientist > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 171-266B, Mailstop: 171-246 > >>>> Email: [email protected] > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Assistant Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Chris Mattmann, Ph.D. > >>> Senior Computer Scientist > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> Office: 171-266B, Mailstop: 171-246 > >>> Email: [email protected] > >>> WWW: http://sunset.usc.edu/~mattmann/ > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Adjunct Assistant Professor, Computer Science Department > >>> University of Southern California, Los Angeles, CA 90089 USA > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > -------------------------- > > Ken Krugler > > http://www.scaleunlimited.com > > custom big data solutions & training > > Hadoop, Cascading, Mahout & Solr > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -- Markus Jelsma - CTO - Openindex

