Re: Can't get Nutch to crawl PDFs

Marek Bachmann Thu, 24 Nov 2011 07:46:47 -0800

I think when you use the crawl command instead of the single commands,
you have to specify the regEx rules in the crawl-urlfilter.txt file.
But I don't know if it is still the case in 1.4


 Could that be the problem?



On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> 
>>>
>>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
>>> how to make it work in 1.4 (instead of editing the global, top-level
>>> conf/nutch-default.xml,
>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
>>> forging ahead.
>>>
>>
>> yep, I think this is documented on the Wiki. It is partially why I
>> suggested that we deliver the content of runtime/local as our binary
>> release for next time. Most people use Nutch in local mode so this would
>> make their lives easier, as for the advanced users (read pseudo or real
>> distributed) they need to recompile the job file anyway and I'd expect them
>> to use the src release
> 
> +1, I'll be happy to edit build.xml and make that happen for 1.5.
> 
> In the meanwhile, time to figure out why I still can't get it to crawl 
> the PDFs... :(
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: Can't get Nutch to crawl PDFs

Reply via email to