Hi Marek,

On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:

> I think when you use the crawl command instead of the single commands,
> you have to specify the regEx rules in the crawl-urlfilter.txt file.
> But I don't know if it is still the case in 1.4
> 
> Could that be the problem?

Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also 
it looks like urlfilter-regex is the one that's enabled by default 
and shipped with the basic config. 

Thanks for trying to help though. I'm going to figure this out! Or, 
someone is going to probably tell me what I'm doing wrong. 
We'll see what happens first :-)

Cheers,
Chris

> 
> 
> 
> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>> 
>>>> 
>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
>>>> how to make it work in 1.4 (instead of editing the global, top-level
>>>> conf/nutch-default.xml,
>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
>>>> forging ahead.
>>>> 
>>> 
>>> yep, I think this is documented on the Wiki. It is partially why I
>>> suggested that we deliver the content of runtime/local as our binary
>>> release for next time. Most people use Nutch in local mode so this would
>>> make their lives easier, as for the advanced users (read pseudo or real
>>> distributed) they need to recompile the job file anyway and I'd expect them
>>> to use the src release
>> 
>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>> 
>> In the meanwhile, time to figure out why I still can't get it to crawl 
>> the PDFs... :(
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to