Hey Guys, Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting all the at_download links.
Phew! Who would have thought. Well glad Nutch is doing its thing, and doing it correctly! :-) Thanks guys. Cheers, Chris On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: > Hey Guys, > > Here is the latest red-herring. I think I was using too small of a -topN > parameter > in my crawl, which was limiting the whole fetch. I was using -depth 10, and > -topN 10, > which thinking about it now was limiting to 100 pages at any depth level of > links, which > was too limited I think since most pages include > 100 pages in terms of > outlinks and > so forth. So parsing, regex, everything was working fine, it just wasn't > following the > links down, because it exceeded -topN * -depth. > > I'm running a new crawl now and it seems to be getting a TON more URLs. Full > crawls for me were limited to around ~5k URLs before which I think was the > problem. > Fingers crossed! > > Cheers, > Chris > > On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: > >> Hey Markus, >> >> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: >> >>> Hi Chris >>> >>> https://issues.apache.org/jira/browse/NUTCH-1087 >> >> Thanks for the pointer. I'll check it out. >> >>> >>> Use the org.apache.nutch.net.URLFilterChecker to test. >> >> Sweet, I didn't know about this tool. OK, I tried it out, check it (note >> that this includes my instrumented stuff, hence >> the printlns): >> >> echo >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file" >> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [false] >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [false] >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [true] >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> >> So, looks like it didn't match the first 2 rules, but matched the 3rd one >> and thus it actually includes the URL fine. So, watch this, here are >> my 3 relevant rules: >> >> # skip file: ftp: and mailto: urls >> -^(file|ftp|mailto): >> >> # skip image and other suffixes we can't yet parse >> # for a more extensive coverage use the urlfilter-suffix plugin >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ >> >> +^http://([a-z0-9]*\.)*vault.fbi.gov/ >> >> So, that makes perfect sense. RegexURLFilter appears to be working normally, >> so that's fine. >> >> So, .... what's the deal, then? ParserChecker works fine, it shows that an >> outlink from this URL: >> >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >> >> Is in fact the at_download link: >> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >> org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >> "download" >> outlink: toUrl: >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> anchor: watergat1summary.pdf >> [chipotle:local/nutch/framework] mattmann% >> >> RegexURLFilter takes in either of those URLs, and says they are fine: >> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch >> org.apache.nutch.parse.ParserChecker >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep >> download | awk '{print $3}' | ./bin/nutch >> org.apache.nutch.net.URLFilterChecker -filterName >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [false] >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [false] >> @#((#(#@ EVALUATING at_download LINK!: >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: >> matched? [true] >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file >> [chipotle:local/nutch/framework] mattmann% >> >> [chipotle:local/nutch/framework] mattmann% echo >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | >> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter >> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >> doesn't have at_download in it! >> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >> doesn't have at_download in it! >> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] >> doesn't have at_download in it! >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view >> [chipotle:local/nutch/framework] mattmann% >> >> Any idea why i wouldn't get getting the at_download URLs downloaded then? >> Here's http.content.limit, >> db.max.outlinks from my Nutch conf: >> >> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 >> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml >> confuse this setting with the http.content.limit setting. >> </description> >> -- >> <name>http.content.limit</name> >> <value>-1</value> >> -- >> <name>db.max.outlinks.per.page</name> >> <value>-1</value> >> -- >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks >> will be processed for a page; otherwise, all outlinks will be processed. >> [chipotle:local/nutch/framework] mattmann% >> >> >> Cheers, >> Chris >> >>> >>>> Hey Markus, >>>> >>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: >>>>>> Hey Markus, >>>>>> >>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: >>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl >>>>>>> command. I don't know what happens if it isnt there. >>>>>> >>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built >>>>>> conf directory in runtime/local/conf from 1.4? >>>>> >>>>> Its gone! I checked and last saw it in 1.2. Strange >>>>> >>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you >>>>>>> ask me. >>>>>> >>>>>> I'd be in favor of replacing the current Crawl command with a simple >>>>>> Java driver that just calls the underlying Inject, Generate, and Fetch >>>>>> tools. Would that work? >>>>> >>>>> There's an open issue to replace it with a basic crawl shell script. It's >>>>> easier to understand and uses the same commands. Non-Java users should >>>>> be able to deal with it better, and provide us with better problem >>>>> descriptions. >>>> >>>> +1, that would be cool indeed. Do you know what issue it is? >>>> >>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out >>>> if it's dropping the at_download URLs for whatever reason. Sigh. >>>> >>>> Cheers, >>>> Chris >>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >>>>>>>> Hi Marek, >>>>>>>> >>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>>>>>>>> I think when you use the crawl command instead of the single >>>>>>>>> commands, you have to specify the regEx rules in the >>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case >>>>>>>>> in 1.4 >>>>>>>>> >>>>>>>>> Could that be the problem? >>>>>>>> >>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. >>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by >>>>>>>> default and shipped with the basic config. >>>>>>>> >>>>>>>> Thanks for trying to help though. I'm going to figure this out! Or, >>>>>>>> someone is going to probably tell me what I'm doing wrong. >>>>>>>> We'll see what happens first :-) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>>>>>>>> global, top-level conf/nutch-default.xml, >>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling >>>>>>>>>>>> is forging ahead. >>>>>>>>>>> >>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why >>>>>>>>>>> I suggested that we deliver the content of runtime/local as our >>>>>>>>>>> binary release for next time. Most people use Nutch in local >>>>>>>>>>> mode so this would make their lives easier, as for the advanced >>>>>>>>>>> users (read pseudo or real distributed) they need to recompile >>>>>>>>>>> the job file anyway and I'd expect them to use the src release >>>>>>>>>> >>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>>>>>>>> >>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to >>>>>>>>>> crawl the PDFs... :( >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Chris >>>>>>>>>> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> Chris Mattmann, Ph.D. >>>>>>>>>> Senior Computer Scientist >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>>>> Email: [email protected] >>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Senior Computer Scientist >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>> Email: [email protected] >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

