> Hey Markus,
> 
> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> > I think Marek is right, the crawl-filter _is_ used in the crawl command.
> > I don't know what happens if it isnt there.
> 
> Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf
> directory in runtime/local/conf from 1.4?

Its gone! I checked and last saw it in 1.2. Strange

> 
> > Good reasons to get rid of the crawl command and stuff in 1.5 if you ask
> > me.
> 
> I'd be in favor of replacing the current Crawl command with a simple Java
> driver that just calls the underlying Inject, Generate, and Fetch tools.
> Would that work?

There's an open issue to replace it with a basic crawl shell script. It's 
easier to understand and uses the same commands. Non-Java users should be able 
to deal with it better, and provide us with better problem descriptions.

> 
> Cheers,
> Chris
> 
> > On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
> >> Hi Marek,
> >> 
> >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
> >>> I think when you use the crawl command instead of the single commands,
> >>> you have to specify the regEx rules in the crawl-urlfilter.txt file.
> >>> But I don't know if it is still the case in 1.4
> >>> 
> >>> Could that be the problem?
> >> 
> >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also
> >> it looks like urlfilter-regex is the one that's enabled by default
> >> and shipped with the basic config.
> >> 
> >> Thanks for trying to help though. I'm going to figure this out! Or,
> >> someone is going to probably tell me what I'm doing wrong.
> >> We'll see what happens first :-)
> >> 
> >> Cheers,
> >> Chris
> >> 
> >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
> >>>>>> figured out how to make it work in 1.4 (instead of editing the
> >>>>>> global, top-level conf/nutch-default.xml,
> >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
> >>>>>> forging ahead.
> >>>>> 
> >>>>> yep, I think this is documented on the Wiki. It is partially why I
> >>>>> suggested that we deliver the content of runtime/local as our binary
> >>>>> release for next time. Most people use Nutch in local mode so this
> >>>>> would make their lives easier, as for the advanced users (read pseudo
> >>>>> or real distributed) they need to recompile the job file anyway and
> >>>>> I'd expect them to use the src release
> >>>> 
> >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
> >>>> 
> >>>> In the meanwhile, time to figure out why I still can't get it to crawl
> >>>> the PDFs... :(
> >>>> 
> >>>> Cheers,
> >>>> Chris
> >>>> 
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Senior Computer Scientist
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 171-266B, Mailstop: 171-246
> >>>> Email: [email protected]
> >>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Assistant Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to