Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 17:47:56 -0800

OK, it didn't work again: here are the URLs from a full crawl cycle:

http://pastebin.com/Jx3Ar6Md


When run independently, where I seed it with an *at_download* URL, 
direct to the PDF, it parses the PDF. But when I run it like normal with topN 
10 and
depth 10, it doesn't pick them up. 

/me stumped

I'll poke around in the code but was just wondering if I was doing something
wrong.

Cheers,
Chris

On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote:

> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
> how to make it work in 1.4 (instead of editing the global, top-level 
> conf/nutch-default.xml, 
> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging 
> ahead.
> 
> I'll report back on if I'm able to grab the PDFs or not, using 1.4...
> 
> Cheers,
> Chris
> 
> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote:
> 
>> *really* weird.
>> 
>> With 1.4, even though I have my http.agent.name property set in 
>> conf/nutch-default.xml, 
>> it keeps telling me this:
>> 
>> Fetcher: No agents listed in 'http.agent.name' property.
>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No 
>> agents listed in 'http.agent.name' property.
>>      at 
>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
>>      at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
>>      at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>> [chipotle:local/nutch/framework] mattmann% 
>> 
>> When I try and crawl.
>> 
>> Is nutch-default.xml not read by the crawl command in 1.4?
>> 
>> Cheers,
>> Chris
>> 
>> 
>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
>> 
>>> Can you also try with trunk or 1.4?  I get different output with 
>>> parsechecker 
>>> such as a proper title.
>>> 
>>> 
>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch 
>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>> of-02/at_download/file
>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>> of-02/at_download/file
>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>> of-02/at_download/file
>>> contentType: application/pdf
>>> signature: 818fd03d7f9011b4f7000657e2aaf966
>>> ---------
>>> Url
>>> ---------------
>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>> of-02/at_download/file---------
>>> ParseData
>>> ---------
>>> Version: 5
>>> Status: success(1,0)
>>> Title: Watergate Summary Part 02 of 02
>>> Outlinks: 0
>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 
>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; 
>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>>> Server=HTML 
>>> Cache-Control=max-age=604800 
>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z 
>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 
>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf 
>>> creator=FBI
>>> 
>>> 
>>> 
>>>> Hey Markus,
>>>> 
>>>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>>> 
>>>> I'll try injecting that single URL and see if I can get it to download
>>>> using separate commands and see what happens! :-)
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>>>> Can you also check without merging segments? Or as a last resort, inject
>>>>> that single URL in an empty crawl db and do a single crawl cycle,
>>>>> preferably by using separate commands instead of the crawl command?
>>>>> 
>>>>>> Hey Guys,
>>>>>> 
>>>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>>> 
>>>>>> http://vault.fbi.gov/
>>>>>> 
>>>>>> My regex-url filter diff is:
>>>>>> 
>>>>>> # accept anything else
>>>>>> #+.
>>>>>> 
>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>>> 
>>>>>> I'm trying to get it to parse PDFs like:
>>>>>> 
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>> ad/ file
>>>>>> 
>>>>>> I see that my config ParserChecker lets me parse it OK:
>>>>>> 
>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>>>> org.apache.nutch.parse.ParserChecker
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>> ad /file fetching:
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>> ad /file parsing:
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>> ad /file contentType: application/pdf
>>>>>> ---------
>>>>>> Url
>>>>>> ---------------
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>> ad/ file--------- ParseData
>>>>>> ---------
>>>>>> Version: 5
>>>>>> Status: success(1,0)
>>>>>> Title:
>>>>>> Outlinks: 0
>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>>>> Content-Type=application/pdf
>>>>>> 
>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>>>> and handles * contentType.
>>>>>> 
>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>>>> for URL, I see it getting to like:
>>>>>> 
>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>>> 
>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>>>> adding it to the outlinks, as I never see a:
>>>>>> 
>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>>>> e
>>>>>> 
>>>>>> In the URL list.
>>>>>> 
>>>>>> I'm running this command to crawl:
>>>>>> 
>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>>> 
>>>>>> Any idea what I'm doing wrong?
>>>>>> 
>>>>>> Cheers
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Senior Computer Scientist
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>> Email: [email protected]
>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to