off topic. we are talking about an issue with the SQL backend in GORA, not
the performance of Nutch.


Julien

On 18 October 2012 20:28, Stefan Scheffler <[email protected]>wrote:

> Hi,
> The problem why nutch is so slow is, that all of the steps uses hadoop
> jobs which takes a long time to start. As well there is somewhere a
> hardcoded 3 second delay in the hadoop core which makes sense in
> distributed systems. But not on single machines.
>
> Regards
> stefan
>
> Am 18.10.2012 17:55, schrieb Luca Vasarelli:
>
>  Hello,
>>
>> I'm using Nutch 2.1 on Linux and I'm having similar problem of
>> http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.
>>
>> I've built a simple localhost site, with 3 pages linked each other:
>> first.htm -> second.htm -> third.htm
>>
>> I did these steps:
>>
>> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
>> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql
>> backend (thanks to [1])
>> - edited ${TEMP_NUTCH}/conf/gora.**properties removing default sql
>> configuration and adding mysql properties (thanks to [1])
>> - ran "ant runtime" from ${TEMP_NUTCH}
>> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
>> - edited ${NUTCH_HOME}/conf/nutch-site.**xml adding http.agent.name,
>> http.robots.agents and changing db.ignore.external.links to true and
>> fetcher.server.delay to 0.0
>> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.*
>> *htm <http://localhost/test/first.htm>" inside this file
>> - created db table as [1]
>>
>> Then I ran "bin/nutch crawl urls -threads 1"
>>
>> first.htm was fetched 5 times
>> second.htm was fetched 4 times
>> third.htm was fetched 3 times
>>
>> I tried doing each step separately (inject, generate, ...) with the same
>> results.
>>
>> Also the whole process take about 2 minutes, am I missing something about
>> some delay config or is this normal?
>>
>> Some extra info:
>>
>> - HTML of the pages: http://pastebin.com/dyDPJeZs
>> - Hadoop log: http://pastebin.com/rwQQPnkE
>> - nutch-site.xml: http://pastebin.com/0WArkvh5
>> - Wireshark log: http://pastebin.com/g4Bg17Ls
>> - MySQL table: http://pastebin.com/gD2SvGsy
>>
>> [1] http://nlp.solutions.asia/?p=**180 <http://nlp.solutions.asia/?p=180>
>>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to