Thanks for the answers. I'm not shure if the 'http.agent.name' is the
problem since I set it:
This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml:
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>MyFirstNutchCrawler</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
As I understand the tutorial this should be correct:
turoial citation "Search for http.agent.name , and give it value
'YOURNAME Spider'"
I already had that set this way in my first email.
2011/7/10 Ing. Yusniel Hidalgo Delgado <[email protected]>:
> Paul, I think that your problem is related with 'http.agent.name' property.
> Please, change this property in your configuration file, such as describe the
> tutorial in:
>
>
>
> Good! You are almost ready to crawl. You need to give your crawler a name.
> This is required.
>
> 1. Open up $NUTCH_HOME/conf/nutch-default.xml file
> 2.
>
> Search for http.agent.name , and give it value 'YOURNAME Spider'
> 3.
>
> Optionally you may also set http.agent.url and http.agent.email properties.
>
> and try again.
>
> Grettings
>
> ----- Mensaje original -----
> De: "Paul van Hoven" <[email protected]>
> Para: [email protected]
> Enviados: Domingo, 10 de Julio 2011 7:42:47 GMT -08:00 Tijuana / Baja
> California
> Asunto: Problems with tutorial
>
> I'm completly new to nutch so I downloaded version 1.3 and worked
> through the beginners tutorial at
> http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I
> did not find the file "conf/crawl-urlfilter.txt" so I omitted that and
> continued with launiching nutch. Therefore I created a plain text file
> in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which
> contains the following text:
>
> tom:crawled toom$ cat urls.txt
> http://nutch.apache.org/
>
> So after that I invoked nutch by calling
> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir
> /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
> solrUrl is not set, indexing will be skipped...
> crawl started in: /Users/toom/Downloads/nutch-1.3/sites
> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
> threads = 10
> depth = 3
> solrUrl=null
> topN = 50
> Injector: starting at 2011-07-07 14:02:31
> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
> Generator: starting at 2011-07-07 14:02:35
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
> Fetcher: No agents listed in 'http.agent.name' property.
> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher:
> No agents listed in 'http.agent.name' property.
> at
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:135)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
>
>
> I do not understand what happend here, maybe one of you can help me?
>
>
>
> --
>
>
>
> --------------------------------------------------------------------------------------------
> Ing. Yusniel Hidalgo Delgado
> Participe en COMPUMAT 2011 http://www.mfc.uclv.edu.cu/scmc
> Participe en INFO 2012 http://www.congreso-info.cu
> Universidad de las Ciencias Informáticas
> --------------------------------------------------------------------------------------------
>