Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Thu, 24 Nov 2011 09:30:10 -0800

Hey Lewis,

Thanks for the offer to help. Basically to summarize, here's what I'm dealing 
with.


I'm trying to download all the PDFs from vault.fbi.gov. It's a Plone-based, CMS
site. So the PDF urls, don't actually end in .pdf. Take a look at, e.g.,:

http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view

Notice that on that page, there is a link to:

http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file

That's the actual link to the PDF file. The view page pulls up some 
browser-based
PDF viewer, but I just want the at_download/file link to get picked up. For 
whatever
reason, it's not being picked up for me in either Nutch 1.3 or Nutch 1.4, using 
the most basic configuration, where I only changed the regex-urlfilter.txt to 
comment
out everything, and to include:

+^http://([a-z0-9]*\.)*vault.fbi.gov/

I've tried crawling using the crawl tool, I've tried separate inject, generate, 
and fetch
cycles, and either way, for whatever reason, Nutch won't pick up the danged 
at_download
URLs on these pages. While searching through the docs, I ran across this:

http://s.apache.org/wIC

And I also noticed that each one of those Vault pages has 150+ outlinks on it. 
So, I 
changed that property in nutch-default.xml to -1 that limits the outlinks. I 
also made
changes to http.content.limit to be -1. So, I think both of those properties 
are set 
fine. 

I just can't get it to crawl the PDF files :(

Help is welcome. :-)

Cheers,
Chris


On Nov 24, 2011, at 9:10 AM, Lewis John Mcgibbney wrote:

> Hey Chris,
> 
> Obviously I've read your thread and I would like to try and help here if I
> can.
> 
> Can you sum up in a sentence or two what you think is happening, what you
> would like to happen?
> 
> Is the issue simply that Nutch is not fetching/parsing certain PDF's?
> 
> On Thu, Nov 24, 2011 at 3:59 PM, Mattmann, Chris A (388J) <
> [email protected]> wrote:
> 
>> Hi Marek,
>> 
>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>> 
>>> I think when you use the crawl command instead of the single commands,
>>> you have to specify the regEx rules in the crawl-urlfilter.txt file.
>>> But I don't know if it is still the case in 1.4
>>> 
>>> Could that be the problem?
>> 
>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir. Also
>> it looks like urlfilter-regex is the one that's enabled by default
>> and shipped with the basic config.
>> 
>> Thanks for trying to help though. I'm going to figure this out! Or,
>> someone is going to probably tell me what I'm doing wrong.
>> We'll see what happens first :-)
>> 
>> Cheers,
>> Chris
>> 
>>> 
>>> 
>>> 
>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>>>> 
>>>>>> 
>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
>> figured out
>>>>>> how to make it work in 1.4 (instead of editing the global, top-level
>>>>>> conf/nutch-default.xml,
>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is
>>>>>> forging ahead.
>>>>>> 
>>>>> 
>>>>> yep, I think this is documented on the Wiki. It is partially why I
>>>>> suggested that we deliver the content of runtime/local as our binary
>>>>> release for next time. Most people use Nutch in local mode so this
>> would
>>>>> make their lives easier, as for the advanced users (read pseudo or real
>>>>> distributed) they need to recompile the job file anyway and I'd expect
>> them
>>>>> to use the src release
>>>> 
>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>>>> 
>>>> In the meanwhile, time to figure out why I still can't get it to crawl
>>>> the PDFs... :(
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
> 
> 
> -- 
> *Lewis*


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to