Hey Guys, Here is the latest red-herring. I think I was using too small of a -topN parameter in my crawl, which was limiting the whole fetch. I was using -depth 10, and -topN 10, which thinking about it now was limiting to 100 pages at any depth level of links, which was too limited I think since most pages include > 100 pages in terms of outlinks and so forth. So parsing, regex, everything was working fine, it just wasn't following the links down, because it exceeded -topN * -depth.
I'm running a new crawl now and it seems to be getting a TON more URLs. Full crawls for me were limited to around ~5k URLs before which I think was the problem. Fingers crossed! Cheers, Chris On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: > Hey Markus, > > On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: > >> Hi Chris >> >> https://issues.apache.org/jira/browse/NUTCH-1087 > > Thanks for the pointer. I'll check it out. > >> >> Use the org.apache.nutch.net.URLFilterChecker to test. > > Sweet, I didn't know about this tool. OK, I tried it out, check it (note that > this includes my instrumented stuff, hence > the printlns): > > echo > "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file" > | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > org.apache.nutch.urlfilter.regex.RegexURLFilter > Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [false] > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [false] > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [true] > +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > > So, looks like it didn't match the first 2 rules, but matched the 3rd one and > thus it actually includes the URL fine. So, watch this, here are > my 3 relevant rules: > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > +^http://([a-z0-9]*\.)*vault.fbi.gov/ > > So, that makes perfect sense. RegexURLFilter appears to be working normally, > so that's fine. > > So, .... what's the deal, then? ParserChecker works fine, it shows that an > outlink from this URL: > > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > > Is in fact the at_download link: > > [chipotle:local/nutch/framework] mattmann% ./bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep > "download" > outlink: toUrl: > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > anchor: watergat1summary.pdf > [chipotle:local/nutch/framework] mattmann% > > RegexURLFilter takes in either of those URLs, and says they are fine: > > [chipotle:local/nutch/framework] mattmann% ./bin/nutch > org.apache.nutch.parse.ParserChecker > http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep > download | awk '{print $3}' | ./bin/nutch > org.apache.nutch.net.URLFilterChecker -filterName > org.apache.nutch.urlfilter.regex.RegexURLFilter > Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [false] > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [false] > @#((#(#@ EVALUATING at_download LINK!: > [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: > matched? [true] > +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file > [chipotle:local/nutch/framework] mattmann% > > [chipotle:local/nutch/framework] mattmann% echo > "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | > ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > org.apache.nutch.urlfilter.regex.RegexURLFilter > Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter > URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > doesn't have at_download in it! > URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > doesn't have at_download in it! > URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > doesn't have at_download in it! > +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > [chipotle:local/nutch/framework] mattmann% > > Any idea why i wouldn't get getting the at_download URLs downloaded then? > Here's http.content.limit, > db.max.outlinks from my Nutch conf: > > [chipotle:local/nutch/framework] mattmann% egrep -i -A1 > "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml > confuse this setting with the http.content.limit setting. > </description> > -- > <name>http.content.limit</name> > <value>-1</value> > -- > <name>db.max.outlinks.per.page</name> > <value>-1</value> > -- > If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks > will be processed for a page; otherwise, all outlinks will be processed. > [chipotle:local/nutch/framework] mattmann% > > > Cheers, > Chris > >> >>> Hey Markus, >>> >>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: >>>>> Hey Markus, >>>>> >>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: >>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl >>>>>> command. I don't know what happens if it isnt there. >>>>> >>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built >>>>> conf directory in runtime/local/conf from 1.4? >>>> >>>> Its gone! I checked and last saw it in 1.2. Strange >>>> >>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you >>>>>> ask me. >>>>> >>>>> I'd be in favor of replacing the current Crawl command with a simple >>>>> Java driver that just calls the underlying Inject, Generate, and Fetch >>>>> tools. Would that work? >>>> >>>> There's an open issue to replace it with a basic crawl shell script. It's >>>> easier to understand and uses the same commands. Non-Java users should >>>> be able to deal with it better, and provide us with better problem >>>> descriptions. >>> >>> +1, that would be cool indeed. Do you know what issue it is? >>> >>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out >>> if it's dropping the at_download URLs for whatever reason. Sigh. >>> >>> Cheers, >>> Chris >>> >>>>> Cheers, >>>>> Chris >>>>> >>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >>>>>>> Hi Marek, >>>>>>> >>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>>>>>>> I think when you use the crawl command instead of the single >>>>>>>> commands, you have to specify the regEx rules in the >>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case >>>>>>>> in 1.4 >>>>>>>> >>>>>>>> Could that be the problem? >>>>>>> >>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. >>>>>>> Also it looks like urlfilter-regex is the one that's enabled by >>>>>>> default and shipped with the basic config. >>>>>>> >>>>>>> Thanks for trying to help though. I'm going to figure this out! Or, >>>>>>> someone is going to probably tell me what I'm doing wrong. >>>>>>> We'll see what happens first :-) >>>>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>>>>>>> global, top-level conf/nutch-default.xml, >>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling >>>>>>>>>>> is forging ahead. >>>>>>>>>> >>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why >>>>>>>>>> I suggested that we deliver the content of runtime/local as our >>>>>>>>>> binary release for next time. Most people use Nutch in local >>>>>>>>>> mode so this would make their lives easier, as for the advanced >>>>>>>>>> users (read pseudo or real distributed) they need to recompile >>>>>>>>>> the job file anyway and I'd expect them to use the src release >>>>>>>>> >>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>>>>>>> >>>>>>>>> In the meanwhile, time to figure out why I still can't get it to >>>>>>>>> crawl the PDFs... :( >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Chris >>>>>>>>> >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>> Senior Computer Scientist >>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>> Email: [email protected] >>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Senior Computer Scientist >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>> Email: [email protected] >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

