Hey Markus, On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
> > Hey Markus, > > > > On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: > > > I think Marek is right, the crawl-filter _is_ used in the crawl command. > > > I don't know what happens if it isnt there. > > > > Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf > > directory in runtime/local/conf from 1.4? > Its gone! I checked and last saw it in 1.2. Strange > > > > > Good reasons to get rid of the crawl command and stuff in 1.5 if you ask > > > me. > > > > I'd be in favor of replacing the current Crawl command with a simple Java > > driver that just calls the underlying Inject, Generate, and Fetch tools. > > Would that work? > There's an open issue to replace it with a basic crawl shell script. It's > easier to understand and uses the same commands. Non-Java users should be > able to deal with it better, and provide us with better problem descriptions. +1, that would be cool indeed. Do you know what issue it is? BTW, I'm currently instrument urlfilter-regex to see if I can figure out if it's dropping the at_download URLs for whatever reason. Sigh. Cheers, Chris > > > > Cheers, > > Chris > > > > > On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: > > >> Hi Marek, > > >> > > >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > > >>> I think when you use the crawl command instead of the single commands, > > >>> you have to specify the regEx rules in the crawl-urlfilter.txt file. > > >>> But I don't know if it is still the case in 1.4 > > >>> > > >>> Could that be the problem? > > >> > > >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also > > >> it looks like urlfilter-regex is the one that's enabled by default > > >> and shipped with the basic config. > > >> > > >> Thanks for trying to help though. I'm going to figure this out! Or, > > >> someone is going to probably tell me what I'm doing wrong. > > >> We'll see what happens first :-) > > >> > > >> Cheers, > > >> Chris > > >> > > >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > > >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > > >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I > > >>>>>> figured out how to make it work in 1.4 (instead of editing the > > >>>>>> global, top-level conf/nutch-default.xml, > > >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is > > >>>>>> forging ahead. > > >>>>> > > >>>>> yep, I think this is documented on the Wiki. It is partially why I > > >>>>> suggested that we deliver the content of runtime/local as our binary > > >>>>> release for next time. Most people use Nutch in local mode so this > > >>>>> would make their lives easier, as for the advanced users (read pseudo > > >>>>> or real distributed) they need to recompile the job file anyway and > > >>>>> I'd expect them to use the src release > > >>>> > > >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. > > >>>> > > >>>> In the meanwhile, time to figure out why I still can't get it to crawl > > >>>> the PDFs... :( > > >>>> > > >>>> Cheers, > > >>>> Chris > > >>>> > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>> Chris Mattmann, Ph.D. > > >>>> Senior Computer Scientist > > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >>>> Office: 171-266B, Mailstop: 171-246 > > >>>> Email: [email protected] > > >>>> WWW: http://sunset.usc.edu/~mattmann/ > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>> Adjunct Assistant Professor, Computer Science Department > > >>>> University of Southern California, Los Angeles, CA 90089 USA > > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Chris Mattmann, Ph.D. > > >> Senior Computer Scientist > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >> Office: 171-266B, Mailstop: 171-246 > > >> Email: [email protected] > > >> WWW: http://sunset.usc.edu/~mattmann/ > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Adjunct Assistant Professor, Computer Science Department > > >> University of Southern California, Los Angeles, CA 90089 USA > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

