Hey Markus, On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> I think Marek is right, the crawl-filter _is_ used in the crawl command. I > don't know what happens if it isnt there. Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf directory in runtime/local/conf from 1.4? > > Good reasons to get rid of the crawl command and stuff in 1.5 if you ask me. I'd be in favor of replacing the current Crawl command with a simple Java driver that just calls the underlying Inject, Generate, and Fetch tools. Would that work? Cheers, Chris > > On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote: >> Hi Marek, >> >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >>> I think when you use the crawl command instead of the single commands, >>> you have to specify the regEx rules in the crawl-urlfilter.txt file. >>> But I don't know if it is still the case in 1.4 >>> >>> Could that be the problem? >> >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also >> it looks like urlfilter-regex is the one that's enabled by default >> and shipped with the basic config. >> >> Thanks for trying to help though. I'm going to figure this out! Or, >> someone is going to probably tell me what I'm doing wrong. >> We'll see what happens first :-) >> >> Cheers, >> Chris >> >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >>>>>> figured out how to make it work in 1.4 (instead of editing the >>>>>> global, top-level conf/nutch-default.xml, >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is >>>>>> forging ahead. >>>>> >>>>> yep, I think this is documented on the Wiki. It is partially why I >>>>> suggested that we deliver the content of runtime/local as our binary >>>>> release for next time. Most people use Nutch in local mode so this >>>>> would make their lives easier, as for the advanced users (read pseudo >>>>> or real distributed) they need to recompile the job file anyway and >>>>> I'd expect them to use the src release >>>> >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>> >>>> In the meanwhile, time to figure out why I still can't get it to crawl >>>> the PDFs... :( >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- > Markus Jelsma - CTO - Openindex ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

