Hey Markus, On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
> Hi Chris > > https://issues.apache.org/jira/browse/NUTCH-1087 Thanks for the pointer. I'll check it out. > > Use the org.apache.nutch.net.URLFilterChecker to test. Sweet, I didn't know about this tool. OK, I tried it out, check it (note that this includes my instrumented stuff, hence the printlns): echo "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [true] +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file So, looks like it didn't match the first 2 rules, but matched the 3rd one and thus it actually includes the URL fine. So, watch this, here are my 3 relevant rules: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ +^http://([a-z0-9]*\.)*vault.fbi.gov/ So, that makes perfect sense. RegexURLFilter appears to be working normally, so that's fine. So, .... what's the deal, then? ParserChecker works fine, it shows that an outlink from this URL: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view Is in fact the at_download link: [chipotle:local/nutch/framework] mattmann% ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep "download" outlink: toUrl: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file anchor: watergat1summary.pdf [chipotle:local/nutch/framework] mattmann% RegexURLFilter takes in either of those URLs, and says they are fine: [chipotle:local/nutch/framework] mattmann% ./bin/nutch org.apache.nutch.parse.ParserChecker http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep download | awk '{print $3}' | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]: matched? [true] +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% echo "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] doesn't have at_download in it! URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] doesn't have at_download in it! URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] doesn't have at_download in it! +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view [chipotle:local/nutch/framework] mattmann% Any idea why i wouldn't get getting the at_download URLs downloaded then? Here's http.content.limit, db.max.outlinks from my Nutch conf: [chipotle:local/nutch/framework] mattmann% egrep -i -A1 "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml confuse this setting with the http.content.limit setting. </description> -- <name>http.content.limit</name> <value>-1</value> -- <name>db.max.outlinks.per.page</name> <value>-1</value> -- If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. [chipotle:local/nutch/framework] mattmann% Cheers, Chris > >> Hey Markus, >> >> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: >>>> Hey Markus, >>>> >>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: >>>>> I think Marek is right, the crawl-filter _is_ used in the crawl >>>>> command. I don't know what happens if it isnt there. >>>> >>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built >>>> conf directory in runtime/local/conf from 1.4? >>> >>> Its gone! I checked and last saw it in 1.2. Strange >>> >>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you >>>>> ask me. >>>> >>>> I'd be in favor of replacing the current Crawl command with a simple >>>> Java driver that just calls the underlying Inject, Generate, and Fetch >>>> tools. Would that work? >>> >>> There's an open issue to replace it with a basic crawl shell script. It's >>> easier to understand and uses the same commands. Non-Java users should >>> be able to deal with it better, and provide us with better problem >>> descriptions. >> >> +1, that would be cool indeed. Do you know what issue it is? >> >> BTW, I'm currently instrument urlfilter-regex to see if I can figure out >> if it's dropping the at_download URLs for whatever reason. Sigh. >> >> Cheers, >> Chris >> >>>> Cheers, >>>> Chris >>>> >>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >>>>>> Hi Marek, >>>>>> >>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>>>>>> I think when you use the crawl command instead of the single >>>>>>> commands, you have to specify the regEx rules in the >>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case >>>>>>> in 1.4 >>>>>>> >>>>>>> Could that be the problem? >>>>>> >>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. >>>>>> Also it looks like urlfilter-regex is the one that's enabled by >>>>>> default and shipped with the basic config. >>>>>> >>>>>> Thanks for trying to help though. I'm going to figure this out! Or, >>>>>> someone is going to probably tell me what I'm doing wrong. >>>>>> We'll see what happens first :-) >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>>>>>> global, top-level conf/nutch-default.xml, >>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling >>>>>>>>>> is forging ahead. >>>>>>>>> >>>>>>>>> yep, I think this is documented on the Wiki. It is partially why >>>>>>>>> I suggested that we deliver the content of runtime/local as our >>>>>>>>> binary release for next time. Most people use Nutch in local >>>>>>>>> mode so this would make their lives easier, as for the advanced >>>>>>>>> users (read pseudo or real distributed) they need to recompile >>>>>>>>> the job file anyway and I'd expect them to use the src release >>>>>>>> >>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>>>>>> >>>>>>>> In the meanwhile, time to figure out why I still can't get it to >>>>>>>> crawl the PDFs... :( >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Senior Computer Scientist >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>> Email: [email protected] >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

