> Hey Markus, > > On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote: > > I think Marek is right, the crawl-filter _is_ used in the crawl command. > > I don't know what happens if it isnt there. > > Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf > directory in runtime/local/conf from 1.4?
Its gone! I checked and last saw it in 1.2. Strange > > > Good reasons to get rid of the crawl command and stuff in 1.5 if you ask > > me. > > I'd be in favor of replacing the current Crawl command with a simple Java > driver that just calls the underlying Inject, Generate, and Fetch tools. > Would that work? There's an open issue to replace it with a basic crawl shell script. It's easier to understand and uses the same commands. Non-Java users should be able to deal with it better, and provide us with better problem descriptions. > > Cheers, > Chris > > > On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: > >> Hi Marek, > >> > >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > >>> I think when you use the crawl command instead of the single commands, > >>> you have to specify the regEx rules in the crawl-urlfilter.txt file. > >>> But I don't know if it is still the case in 1.4 > >>> > >>> Could that be the problem? > >> > >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also > >> it looks like urlfilter-regex is the one that's enabled by default > >> and shipped with the basic config. > >> > >> Thanks for trying to help though. I'm going to figure this out! Or, > >> someone is going to probably tell me what I'm doing wrong. > >> We'll see what happens first :-) > >> > >> Cheers, > >> Chris > >> > >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I > >>>>>> figured out how to make it work in 1.4 (instead of editing the > >>>>>> global, top-level conf/nutch-default.xml, > >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is > >>>>>> forging ahead. > >>>>> > >>>>> yep, I think this is documented on the Wiki. It is partially why I > >>>>> suggested that we deliver the content of runtime/local as our binary > >>>>> release for next time. Most people use Nutch in local mode so this > >>>>> would make their lives easier, as for the advanced users (read pseudo > >>>>> or real distributed) they need to recompile the job file anyway and > >>>>> I'd expect them to use the src release > >>>> > >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. > >>>> > >>>> In the meanwhile, time to figure out why I still can't get it to crawl > >>>> the PDFs... :( > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Senior Computer Scientist > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 171-266B, Mailstop: 171-246 > >>>> Email: [email protected] > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Assistant Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

