Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 16:27:13 -0800

OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
how to make it work in 1.4 (instead of editing the global, top-level 
conf/nutch-default.xml, 
I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging 
ahead.


I'll report back on if I'm able to grab the PDFs or not, using 1.4...

Cheers,
Chris

On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote:

> *really* weird.
> 
> With 1.4, even though I have my http.agent.name property set in 
> conf/nutch-default.xml, 
> it keeps telling me this:
> 
> Fetcher: No agents listed in 'http.agent.name' property.
> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No 
> agents listed in 'http.agent.name' property.
>       at 
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
>       at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> [chipotle:local/nutch/framework] mattmann% 
> 
> When I try and crawl.
> 
> Is nutch-default.xml not read by the crawl command in 1.4?
> 
> Cheers,
> Chris
> 
> 
> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
> 
>> Can you also try with trunk or 1.4?  I get different output with 
>> parsechecker 
>> such as a proper title.
>> 
>> 
>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch 
>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
>> of-02/at_download/file
>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>> of-02/at_download/file
>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>> of-02/at_download/file
>> contentType: application/pdf
>> signature: 818fd03d7f9011b4f7000657e2aaf966
>> ---------
>> Url
>> ---------------
>> http://vault.fbi.gov/watergate/watergate-summary-part-02-
>> of-02/at_download/file---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: Watergate Summary Part 02 of 02
>> Outlinks: 0
>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 
>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; 
>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>> Server=HTML 
>> Cache-Control=max-age=604800 
>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z 
>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 
>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf 
>> creator=FBI
>> 
>> 
>> 
>>> Hey Markus,
>>> 
>>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>> 
>>> I'll try injecting that single URL and see if I can get it to download
>>> using separate commands and see what happens! :-)
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>>> Can you also check without merging segments? Or as a last resort, inject
>>>> that single URL in an empty crawl db and do a single crawl cycle,
>>>> preferably by using separate commands instead of the crawl command?
>>>> 
>>>>> Hey Guys,
>>>>> 
>>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>> 
>>>>> http://vault.fbi.gov/
>>>>> 
>>>>> My regex-url filter diff is:
>>>>> 
>>>>> # accept anything else
>>>>> #+.
>>>>> 
>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>> 
>>>>> I'm trying to get it to parse PDFs like:
>>>>> 
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>> ad/ file
>>>>> 
>>>>> I see that my config ParserChecker lets me parse it OK:
>>>>> 
>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>>> org.apache.nutch.parse.ParserChecker
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>> ad /file fetching:
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>> ad /file parsing:
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>> ad /file contentType: application/pdf
>>>>> ---------
>>>>> Url
>>>>> ---------------
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>> ad/ file--------- ParseData
>>>>> ---------
>>>>> Version: 5
>>>>> Status: success(1,0)
>>>>> Title:
>>>>> Outlinks: 0
>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>>> Content-Type=application/pdf
>>>>> 
>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>>> and handles * contentType.
>>>>> 
>>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>>> for URL, I see it getting to like:
>>>>> 
>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>> 
>>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>>> adding it to the outlinks, as I never see a:
>>>>> 
>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>>> e
>>>>> 
>>>>> In the URL list.
>>>>> 
>>>>> I'm running this command to crawl:
>>>>> 
>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>> 
>>>>> Any idea what I'm doing wrong?
>>>>> 
>>>>> Cheers
>>>>> Chris
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: [email protected]
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to