Hey Chris, Obviously I've read your thread and I would like to try and help here if I can.
Can you sum up in a sentence or two what you think is happening, what you would like to happen? Is the issue simply that Nutch is not fetching/parsing certain PDF's? On Thu, Nov 24, 2011 at 3:59 PM, Mattmann, Chris A (388J) < [email protected]> wrote: > Hi Marek, > > On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: > > > I think when you use the crawl command instead of the single commands, > > you have to specify the regEx rules in the crawl-urlfilter.txt file. > > But I don't know if it is still the case in 1.4 > > > > Could that be the problem? > > Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also > it looks like urlfilter-regex is the one that's enabled by default > and shipped with the basic config. > > Thanks for trying to help though. I'm going to figure this out! Or, > someone is going to probably tell me what I'm doing wrong. > We'll see what happens first :-) > > Cheers, > Chris > > > > > > > > > On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: > >> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: > >> > >>>> > >>>> OK, nm. This *is* different behavior from 1.3 apparently, but I > figured out > >>>> how to make it work in 1.4 (instead of editing the global, top-level > >>>> conf/nutch-default.xml, > >>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is > >>>> forging ahead. > >>>> > >>> > >>> yep, I think this is documented on the Wiki. It is partially why I > >>> suggested that we deliver the content of runtime/local as our binary > >>> release for next time. Most people use Nutch in local mode so this > would > >>> make their lives easier, as for the advanced users (read pseudo or real > >>> distributed) they need to recompile the job file anyway and I'd expect > them > >>> to use the src release > >> > >> +1, I'll be happy to edit build.xml and make that happen for 1.5. > >> > >> In the meanwhile, time to figure out why I still can't get it to crawl > >> the PDFs... :( > >> > >> Cheers, > >> Chris > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- *Lewis*

