I think when you use the crawl command instead of the single commands, you have to specify the regEx rules in the crawl-urlfilter.txt file. But I don't know if it is still the case in 1.4
Could that be the problem? On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > >>> >>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out >>> how to make it work in 1.4 (instead of editing the global, top-level >>> conf/nutch-default.xml, >>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is >>> forging ahead. >>> >> >> yep, I think this is documented on the Wiki. It is partially why I >> suggested that we deliver the content of runtime/local as our binary >> release for next time. Most people use Nutch in local mode so this would >> make their lives easier, as for the advanced users (read pseudo or real >> distributed) they need to recompile the job file anyway and I'd expect them >> to use the src release > > +1, I'll be happy to edit build.xml and make that happen for 1.5. > > In the meanwhile, time to figure out why I still can't get it to crawl > the PDFs... :( > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >

