Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 16:10:12 -0800

*really* weird.

With 1.4, even though I have my http.agent.name property set in 
conf/nutch-default.xml, 
it keeps telling me this:


Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No 
agents listed in 'http.agent.name' property.
        at 
org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
[chipotle:local/nutch/framework] mattmann% 

When I try and crawl.

Is nutch-default.xml not read by the crawl command in 1.4?

Cheers,
Chris


On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:

> Can you also try with trunk or 1.4?  I get different output with parsechecker 
> such as a proper title.
> 
> 
> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch 
> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> contentType: application/pdf
> signature: 818fd03d7f9011b4f7000657e2aaf966
> ---------
> Url
> ---------------
> http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Watergate Summary Part 02 of 02
> Outlinks: 0
> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 
> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; 
> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
> Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML 
> Cache-Control=max-age=604800 
> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z 
> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 
> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf 
> creator=FBI
> 
> 
> 
>> Hey Markus,
>> 
>> I set the http.content.limit to -1, so it shouldn't have a limit.
>> 
>> I'll try injecting that single URL and see if I can get it to download
>> using separate commands and see what happens! :-)
>> 
>> Cheers,
>> Chris
>> 
>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>> Can you also check without merging segments? Or as a last resort, inject
>>> that single URL in an empty crawl db and do a single crawl cycle,
>>> preferably by using separate commands instead of the crawl command?
>>> 
>>>> Hey Guys,
>>>> 
>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>> 
>>>> http://vault.fbi.gov/
>>>> 
>>>> My regex-url filter diff is:
>>>> 
>>>> # accept anything else
>>>> #+.
>>>> 
>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>> 
>>>> I'm trying to get it to parse PDFs like:
>>>> 
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad/ file
>>>> 
>>>> I see that my config ParserChecker lets me parse it OK:
>>>> 
>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>> org.apache.nutch.parse.ParserChecker
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file fetching:
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file parsing:
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file contentType: application/pdf
>>>> ---------
>>>> Url
>>>> ---------------
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad/ file--------- ParseData
>>>> ---------
>>>> Version: 5
>>>> Status: success(1,0)
>>>> Title:
>>>> Outlinks: 0
>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>> Content-Type=application/pdf
>>>> 
>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>> and handles * contentType.
>>>> 
>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>> for URL, I see it getting to like:
>>>> 
>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>> 
>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>> adding it to the outlinks, as I never see a:
>>>> 
>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>> e
>>>> 
>>>> In the URL list.
>>>> 
>>>> I'm running this command to crawl:
>>>> 
>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>> 
>>>> Any idea what I'm doing wrong?
>>>> 
>>>> Cheers
>>>> Chris
>>>> 
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to