*really* weird.
With 1.4, even though I have my http.agent.name property set in
conf/nutch-default.xml,
it keeps telling me this:
Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No
agents listed in 'http.agent.name' property.
at
org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
[chipotle:local/nutch/framework] mattmann%
When I try and crawl.
Is nutch-default.xml not read by the crawl command in 1.4?
Cheers,
Chris
On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
> Can you also try with trunk or 1.4? I get different output with parsechecker
> such as a proper title.
>
>
> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch
> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file
> contentType: application/pdf
> signature: 818fd03d7f9011b4f7000657e2aaf966
> ---------
> Url
> ---------------
> http://vault.fbi.gov/watergate/watergate-summary-part-02-
> of-02/at_download/file---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Watergate Summary Part 02 of 02
> Outlinks: 0
> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493
> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment;
> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT
> Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
> Cache-Control=max-age=604800
> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z
> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter
> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf
> creator=FBI
>
>
>
>> Hey Markus,
>>
>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>
>> I'll try injecting that single URL and see if I can get it to download
>> using separate commands and see what happens! :-)
>>
>> Cheers,
>> Chris
>>
>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>> Can you also check without merging segments? Or as a last resort, inject
>>> that single URL in an empty crawl db and do a single crawl cycle,
>>> preferably by using separate commands instead of the crawl command?
>>>
>>>> Hey Guys,
>>>>
>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>
>>>> http://vault.fbi.gov/
>>>>
>>>> My regex-url filter diff is:
>>>>
>>>> # accept anything else
>>>> #+.
>>>>
>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>
>>>> I'm trying to get it to parse PDFs like:
>>>>
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad/ file
>>>>
>>>> I see that my config ParserChecker lets me parse it OK:
>>>>
>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>> org.apache.nutch.parse.ParserChecker
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file fetching:
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file parsing:
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad /file contentType: application/pdf
>>>> ---------
>>>> Url
>>>> ---------------
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>> ad/ file--------- ParseData
>>>> ---------
>>>> Version: 5
>>>> Status: success(1,0)
>>>> Title:
>>>> Outlinks: 0
>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>> Content-Type=application/pdf
>>>>
>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>> and handles * contentType.
>>>>
>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>> for URL, I see it getting to like:
>>>>
>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>
>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>> adding it to the outlinks, as I never see a:
>>>>
>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>> e
>>>>
>>>> In the URL list.
>>>>
>>>> I'm running this command to crawl:
>>>>
>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>
>>>> Any idea what I'm doing wrong?
>>>>
>>>> Cheers
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW: http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++