Hey Lewis, Thanks for the offer to help. Basically to summarize, here's what I'm dealing with.
I'm trying to download all the PDFs from vault.fbi.gov. It's a Plone-based, CMS site. So the PDF urls, don't actually end in .pdf. Take a look at, e.g.,: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view Notice that on that page, there is a link to: http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file That's the actual link to the PDF file. The view page pulls up some browser-based PDF viewer, but I just want the at_download/file link to get picked up. For whatever reason, it's not being picked up for me in either Nutch 1.3 or Nutch 1.4, using the most basic configuration, where I only changed the regex-urlfilter.txt to comment out everything, and to include: +^http://([a-z0-9]*\.)*vault.fbi.gov/ I've tried crawling using the crawl tool, I've tried separate inject, generate, and fetch cycles, and either way, for whatever reason, Nutch won't pick up the danged at_download URLs on these pages. While searching through the docs, I ran across this: http://s.apache.org/wIC And I also noticed that each one of those Vault pages has 150+ outlinks on it. So, I changed that property in nutch-default.xml to -1 that limits the outlinks. I also made changes to http.content.limit to be -1. So, I think both of those properties are set fine. I just can't get it to crawl the PDF files :( Help is welcome. :-) Cheers, Chris On Nov 24, 2011, at 9:10 AM, Lewis John Mcgibbney wrote: > Hey Chris, > > Obviously I've read your thread and I would like to try and help here if I > can. > > Can you sum up in a sentence or two what you think is happening, what you > would like to happen? > > Is the issue simply that Nutch is not fetching/parsing certain PDF's? > > On Thu, Nov 24, 2011 at 3:59 PM, Mattmann, Chris A (388J) < > [email protected]> wrote: > >> Hi Marek, >> >> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote: >> >>> I think when you use the crawl command instead of the single commands, >>> you have to specify the regEx rules in the crawl-urlfilter.txt file. >>> But I don't know if it is still the case in 1.4 >>> >>> Could that be the problem? >> >> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also >> it looks like urlfilter-regex is the one that's enabled by default >> and shipped with the basic config. >> >> Thanks for trying to help though. I'm going to figure this out! Or, >> someone is going to probably tell me what I'm doing wrong. >> We'll see what happens first :-) >> >> Cheers, >> Chris >> >>> >>> >>> >>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote: >>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote: >>>> >>>>>> >>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I >> figured out >>>>>> how to make it work in 1.4 (instead of editing the global, top-level >>>>>> conf/nutch-default.xml, >>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is >>>>>> forging ahead. >>>>>> >>>>> >>>>> yep, I think this is documented on the Wiki. It is partially why I >>>>> suggested that we deliver the content of runtime/local as our binary >>>>> release for next time. Most people use Nutch in local mode so this >> would >>>>> make their lives easier, as for the advanced users (read pseudo or real >>>>> distributed) they need to recompile the job file anyway and I'd expect >> them >>>>> to use the src release >>>> >>>> +1, I'll be happy to edit build.xml and make that happen for 1.5. >>>> >>>> In the meanwhile, time to figure out why I still can't get it to crawl >>>> the PDFs... :( >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > > -- > *Lewis* ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

