Re: Questions/issues with nutch

h b Wed, 26 Jun 2013 21:54:11 -0700

Thanks for the response Lewis.
I did read these links, I mostly followed the first link and tried both the
3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on
solr, so I figured that I should first deal with getting the crawl part to
work and then deal with solr indexing. Hence I went back to trying it
stepwise.


As for the second link, it is more about using HBase as store instead of
gora. This is not really a option for me yet, cause my grid does not have
hbase installed yet. Getting it done is not much under my control

the FAQ link is the one I had not gone through until I checked your
response, but I do not find answers to any of my questions
(directly/indirectly) in it.




On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Hemant,
> I strongly advise you to take some time to look through the Nutch Tutorial
> for 1.x and 2.x.
> http://wiki.apache.org/nutch/NutchTutorial
> http://wiki.apache.org/nutch/Nutch2Tutorial
> Also please see the FAQ's, which you will find very very useful.
> http://wiki.apache.org/nutch/FAQ
>
> Thanks
> Lewis
>
>
> On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:
>
> > Hi,
> > I am first time user of nutch. I installed
> > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
> > webpage.
> >
> > I am running nutch step by step. These are the problems I came across -
> >
> > 1. Inject did not work, i..e the url does not reflect in the
> > webdb(gora-memstore). The way I verify this is after running inject, i
> run
> > readdb with dump. This created a directory in hdfs with 0 size part file.
> >
> > 2. config files - This confused me a lot. When run from deploy directory,
> > does nutch use the config files from local/conf? Changes made to
> > local/conf/nutch-site.xml did not take effect after editing this file. I
> > had to edit this in order to get rid of the 'http.agent.name' error. I
> > finally ended up hard-coding this in the code, rebuilding and running to
> > keep going forward.
> >
> > 3. how to interpret readdb - Running readdb -stats, shows a lot out
> output
> > but I do not see my url from seed.txt in there. So I do not know if the
> > entry in webdb actually reflects my seed.txt at all or not.
> >
> > 4. logs - When nutch is run from the deploy directory, the
> logs/hadoop.log
> > is not generated anymore, not locally, nor on the grid. I tried to make
> it
> > verbose by changing log4j.properties to DEBUG, but still had not file
> > generated.
> >
> > Any help with this would help me move forward with nutch.
> >
> > Regards
> > Hemant
> >
>
>
>
> --
> *Lewis*
>

Re: Questions/issues with nutch

Reply via email to