Hey Markus,

On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:

> > Hey Markus,
> > 
> > On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> > > I think Marek is right, the crawl-filter _is_ used in the crawl command.
> > > I don't know what happens if it isnt there.
> > 
> > Interesting. Where is the crawl-urlfilter.txt? It's not in my built conf
> > directory in runtime/local/conf from 1.4?
> Its gone! I checked and last saw it in 1.2. Strange
> > 
> > > Good reasons to get rid of the crawl command and stuff in 1.5 if you ask
> > > me.
> > 
> > I'd be in favor of replacing the current Crawl command with a simple Java
> > driver that just calls the underlying Inject, Generate, and Fetch tools.
> > Would that work?
> There's an open issue to replace it with a basic crawl shell script. It's 
> easier to understand and uses the same commands. Non-Java users should be 
> able to deal with it better, and provide us with better problem descriptions.

+1, that would be cool indeed. Do you know what issue it is?

BTW, I'm currently instrument urlfilter-regex to see if I can figure out 
if it's dropping the at_download URLs for whatever reason. Sigh.

Cheers,
Chris

> > 
> > Cheers,
> > Chris
> > 
> > > On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
> > >> Hi Marek,
> > >> 
> > >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
> > >>> I think when you use the crawl command instead of the single commands,
> > >>> you have to specify the regEx rules in the crawl-urlfilter.txt file.
> > >>> But I don't know if it is still the case in 1.4
> > >>> 
> > >>> Could that be the problem?
> > >> 
> > >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also
> > >> it looks like urlfilter-regex is the one that's enabled by default
> > >> and shipped with the basic config.
> > >> 
> > >> Thanks for trying to help though. I'm going to figure this out! Or,
> > >> someone is going to probably tell me what I'm doing wrong.
> > >> We'll see what happens first :-)
> > >> 
> > >> Cheers,
> > >> Chris
> > >> 
> > >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> > >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> > >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
> > >>>>>> figured out how to make it work in 1.4 (instead of editing the
> > >>>>>> global, top-level conf/nutch-default.xml,
> > >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
> > >>>>>> forging ahead.
> > >>>>> 
> > >>>>> yep, I think this is documented on the Wiki. It is partially why I
> > >>>>> suggested that we deliver the content of runtime/local as our binary
> > >>>>> release for next time. Most people use Nutch in local mode so this
> > >>>>> would make their lives easier, as for the advanced users (read pseudo
> > >>>>> or real distributed) they need to recompile the job file anyway and
> > >>>>> I'd expect them to use the src release
> > >>>> 
> > >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
> > >>>> 
> > >>>> In the meanwhile, time to figure out why I still can't get it to crawl
> > >>>> the PDFs... :(
> > >>>> 
> > >>>> Cheers,
> > >>>> Chris
> > >>>> 
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>> Chris Mattmann, Ph.D.
> > >>>> Senior Computer Scientist
> > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>> Office: 171-266B, Mailstop: 171-246
> > >>>> Email: [email protected]
> > >>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>> Adjunct Assistant Professor, Computer Science Department
> > >>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> 
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Senior Computer Scientist
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 171-266B, Mailstop: 171-246
> > >> Email: [email protected]
> > >> WWW:   http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Assistant Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to