Right! I've been pulling my hair out on similar occasions! Another argument to get rid of the crawl command since the -depth parameter makes little sense in the long run in my opinion. There is no depth information.
I would recommend you and anyone else to learn to use the separate commands like Lewis wrote in the latest tutorial. On Friday 25 November 2011 16:49:43 Mattmann, Chris A (388J) wrote: > Hey Guys, > > Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting > all the at_download links. > > Phew! Who would have thought. Well glad Nutch is doing its thing, and > doing it correctly! :-) > > Thanks guys. > > Cheers, > Chris > > On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote: > > Hey Guys, > > > > Here is the latest red-herring. I think I was using too small of a -topN > > parameter in my crawl, which was limiting the whole fetch. I was using > > -depth 10, and -topN 10, which thinking about it now was limiting to 100 > > pages at any depth level of links, which was too limited I think since > > most pages include > 100 pages in terms of outlinks and so forth. So > > parsing, regex, everything was working fine, it just wasn't following > > the links down, because it exceeded -topN * -depth. > > > > I'm running a new crawl now and it seems to be getting a TON more URLs. > > Full crawls for me were limited to around ~5k URLs before which I think > > was the problem. Fingers crossed! > > > > Cheers, > > Chris > > > > On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote: > >> Hey Markus, > >> > >> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote: > >>> Hi Chris > >>> > >>> https://issues.apache.org/jira/browse/NUTCH-1087 > >> > >> Thanks for the pointer. I'll check it out. > >> > >>> Use the org.apache.nutch.net.URLFilterChecker to test. > >> > >> Sweet, I didn't know about this tool. OK, I tried it out, check it (note > >> that this includes my instrumented stuff, hence the printlns): > >> > >> echo > >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker > >> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking > >> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ > >> EVALUATING at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [true] > >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file > >> > >> So, looks like it didn't match the first 2 rules, but matched the 3rd > >> one and thus it actually includes the URL fine. So, watch this, here > >> are my 3 relevant rules: > >> > >> # skip file: ftp: and mailto: urls > >> -^(file|ftp|mailto): > >> > >> # skip image and other suffixes we can't yet parse > >> # for a more extensive coverage use the urlfilter-suffix plugin > >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip| > >> ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|J > >> PEG|bmp|BMP|js|JS)$ > >> > >> +^http://([a-z0-9]*\.)*vault.fbi.gov/ > >> > >> So, that makes perfect sense. RegexURLFilter appears to be working > >> normally, so that's fine. > >> > >> So, .... what's the deal, then? ParserChecker works fine, it shows that > >> an outlink from this URL: > >> > >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > >> > >> Is in fact the at_download link: > >> > >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > >> org.apache.nutch.parse.ParserChecker > >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | > >> grep "download" outlink: toUrl: > >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_downl > >> oad/file anchor: watergat1summary.pdf [chipotle:local/nutch/framework] > >> mattmann% > >> > >> RegexURLFilter takes in either of those URLs, and says they are fine: > >> > >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch > >> org.apache.nutch.parse.ParserChecker > >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | > >> grep download | awk '{print $3}' | ./bin/nutch > >> org.apache.nutch.net.URLFilterChecker -filterName > >> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > >> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING > >> at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file]: matched? [true] > >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down > >> load/file [chipotle:local/nutch/framework] mattmann% > >> > >> [chipotle:local/nutch/framework] mattmann% echo > >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view" | > >> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName > >> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter > >> org.apache.nutch.urlfilter.regex.RegexURLFilter URL: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >> doesn't have at_download in it! URL: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >> doesn't have at_download in it! URL: > >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] > >> doesn't have at_download in it! > >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view > >> [chipotle:local/nutch/framework] mattmann% > >> > >> Any idea why i wouldn't get getting the at_download URLs downloaded > >> then? Here's http.content.limit, db.max.outlinks from my Nutch conf: > >> > >> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 > >> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml confuse > >> this setting with the http.content.limit setting. > >> </description> > >> -- > >> <name>http.content.limit</name> > >> <value>-1</value> > >> -- > >> <name>db.max.outlinks.per.page</name> > >> <value>-1</value> > >> -- > >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page > >> outlinks will be processed for a page; otherwise, all outlinks will be > >> processed. [chipotle:local/nutch/framework] mattmann% > >> > >> > >> Cheers, > >> Chris > >> > >>>> Hey Markus, > >>>> > >>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote: > >>>>>> Hey Markus, > >>>>>> > >>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: > >>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl > >>>>>>> command. I don't know what happens if it isnt there. > >>>>>> > >>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built > >>>>>> conf directory in runtime/local/conf from 1.4? > >>>>> > >>>>> Its gone! I checked and last saw it in 1.2. Strange > >>>>> > >>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if > >>>>>>> you ask me. > >>>>>> > >>>>>> I'd be in favor of replacing the current Crawl command with a simple > >>>>>> Java driver that just calls the underlying Inject, Generate, and > >>>>>> Fetch tools. Would that work? > >>>>> > >>>>> There's an open issue to replace it with a basic crawl shell script. > >>>>> It's easier to understand and uses the same commands. Non-Java users > >>>>> should be able to deal with it better, and provide us with better > >>>>> problem descriptions. > >>>> > >>>> +1, that would be cool indeed. Do you know what issue it is? > >>>> > >>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure > >>>> out if it's dropping the at_download URLs for whatever reason. Sigh. > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>>>> Cheers, > >>>>>> Chris > >>>>>> > >>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: > >>>>>>>> Hi Marek, > >>>>>>>> > >>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > >>>>>>>>> I think when you use the crawl command instead of the single > >>>>>>>>> commands, you have to specify the regEx rules in the > >>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the > >>>>>>>>> case in 1.4 > >>>>>>>>> > >>>>>>>>> Could that be the problem? > >>>>>>>> > >>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. > >>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by > >>>>>>>> default and shipped with the basic config. > >>>>>>>> > >>>>>>>> Thanks for trying to help though. I'm going to figure this out! > >>>>>>>> Or, someone is going to probably tell me what I'm doing wrong. > >>>>>>>> We'll see what happens first :-) > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Chris > >>>>>>>> > >>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > >>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > >>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but > >>>>>>>>>>>> I figured out how to make it work in 1.4 (instead of editing > >>>>>>>>>>>> the global, top-level conf/nutch-default.xml, > >>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). > >>>>>>>>>>>> Crawling is forging ahead. > >>>>>>>>>>> > >>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially > >>>>>>>>>>> why I suggested that we deliver the content of runtime/local > >>>>>>>>>>> as our binary release for next time. Most people use Nutch in > >>>>>>>>>>> local mode so this would make their lives easier, as for the > >>>>>>>>>>> advanced users (read pseudo or real distributed) they need to > >>>>>>>>>>> recompile the job file anyway and I'd expect them to use the > >>>>>>>>>>> src release > >>>>>>>>>> > >>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for > >>>>>>>>>> 1.5. > >>>>>>>>>> > >>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to > >>>>>>>>>> crawl the PDFs... :( > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> Chris > >>>>>>>>>> > >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> ++ Chris Mattmann, Ph.D. > >>>>>>>>>> Senior Computer Scientist > >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>>>>>> Email: [email protected] > >>>>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department > >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> +++ > >>>>>>>> > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> Chris Mattmann, Ph.D. > >>>>>>>> Senior Computer Scientist > >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>>>> Email: [email protected] > >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> Adjunct Assistant Professor, Computer Science Department > >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Chris Mattmann, Ph.D. > >>>>>> Senior Computer Scientist > >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>> Office: 171-266B, Mailstop: 171-246 > >>>>>> Email: [email protected] > >>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Adjunct Assistant Professor, Computer Science Department > >>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Senior Computer Scientist > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 171-266B, Mailstop: 171-246 > >>>> Email: [email protected] > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Assistant Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -- Markus Jelsma - CTO - Openindex

