Hi Hemant,
I strongly advise you to take some time to look through the Nutch Tutorial
for 1.x and 2.x.
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Nutch2Tutorial
Also please see the FAQ's, which you will find very very useful.
http://wiki.apache.org/nutch/FAQ

Thanks
Lewis


On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:

> Hi,
> I am first time user of nutch. I installed
> nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
> webpage.
>
> I am running nutch step by step. These are the problems I came across -
>
> 1. Inject did not work, i..e the url does not reflect in the
> webdb(gora-memstore). The way I verify this is after running inject, i run
> readdb with dump. This created a directory in hdfs with 0 size part file.
>
> 2. config files - This confused me a lot. When run from deploy directory,
> does nutch use the config files from local/conf? Changes made to
> local/conf/nutch-site.xml did not take effect after editing this file. I
> had to edit this in order to get rid of the 'http.agent.name' error. I
> finally ended up hard-coding this in the code, rebuilding and running to
> keep going forward.
>
> 3. how to interpret readdb - Running readdb -stats, shows a lot out output
> but I do not see my url from seed.txt in there. So I do not know if the
> entry in webdb actually reflects my seed.txt at all or not.
>
> 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log
> is not generated anymore, not locally, nor on the grid. I tried to make it
> verbose by changing log4j.properties to DEBUG, but still had not file
> generated.
>
> Any help with this would help me move forward with nutch.
>
> Regards
> Hemant
>



-- 
*Lewis*

Reply via email to