Hey Chris,

Obviously I've read your thread and I would like to try and help here if I
can.

Can you sum up in a sentence or two what you think is happening, what you
would like to happen?

Is the issue simply that Nutch is not fetching/parsing certain PDF's?

On Thu, Nov 24, 2011 at 3:59 PM, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hi Marek,
>
> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>
> > I think when you use the crawl command instead of the single commands,
> > you have to specify the regEx rules in the crawl-urlfilter.txt file.
> > But I don't know if it is still the case in 1.4
> >
> > Could that be the problem?
>
> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also
> it looks like urlfilter-regex is the one that's enabled by default
> and shipped with the basic config.
>
> Thanks for trying to help though. I'm going to figure this out! Or,
> someone is going to probably tell me what I'm doing wrong.
> We'll see what happens first :-)
>
> Cheers,
> Chris
>
> >
> >
> >
> > On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> >> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> >>
> >>>>
> >>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
> figured out
> >>>> how to make it work in 1.4 (instead of editing the global, top-level
> >>>> conf/nutch-default.xml,
> >>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
> >>>> forging ahead.
> >>>>
> >>>
> >>> yep, I think this is documented on the Wiki. It is partially why I
> >>> suggested that we deliver the content of runtime/local as our binary
> >>> release for next time. Most people use Nutch in local mode so this
> would
> >>> make their lives easier, as for the advanced users (read pseudo or real
> >>> distributed) they need to recompile the job file anyway and I'd expect
> them
> >>> to use the src release
> >>
> >> +1, I'll be happy to edit build.xml and make that happen for 1.5.
> >>
> >> In the meanwhile, time to figure out why I still can't get it to crawl
> >> the PDFs... :(
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*Lewis*

Reply via email to