Hey Markus,

On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:

> I think Marek is right, the crawl-filter _is_ used in the crawl command. I 
> don't know what happens if it isnt there.

Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf 
directory
in runtime/local/conf from 1.4?

> 
> Good reasons to get rid of the crawl command and stuff in 1.5 if you ask me.

I'd be in favor of replacing the current Crawl command with a simple Java 
driver that just calls the underlying Inject, Generate, and Fetch tools. Would 
that work?

Cheers,
Chris

> 
> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
>> Hi Marek,
>> 
>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>>> I think when you use the crawl command instead of the single commands,
>>> you have to specify the regEx rules in the crawl-urlfilter.txt file.
>>> But I don't know if it is still the case in 1.4
>>> 
>>> Could that be the problem?
>> 
>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also
>> it looks like urlfilter-regex is the one that's enabled by default
>> and shipped with the basic config.
>> 
>> Thanks for trying to help though. I'm going to figure this out! Or,
>> someone is going to probably tell me what I'm doing wrong.
>> We'll see what happens first :-)
>> 
>> Cheers,
>> Chris
>> 
>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
>>>>>> figured out how to make it work in 1.4 (instead of editing the
>>>>>> global, top-level conf/nutch-default.xml,
>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
>>>>>> forging ahead.
>>>>> 
>>>>> yep, I think this is documented on the Wiki. It is partially why I
>>>>> suggested that we deliver the content of runtime/local as our binary
>>>>> release for next time. Most people use Nutch in local mode so this
>>>>> would make their lives easier, as for the advanced users (read pseudo
>>>>> or real distributed) they need to recompile the job file anyway and
>>>>> I'd expect them to use the src release
>>>> 
>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>>>> 
>>>> In the meanwhile, time to figure out why I still can't get it to crawl
>>>> the PDFs... :(
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> -- 
> Markus Jelsma - CTO - Openindex


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to