Re: Questions/issues with nutch

Tejas Patil Wed, 26 Jun 2013 23:08:13 -0700

On Wed, Jun 26, 2013 at 10:26 PM, h b <[email protected]> wrote:

> The quick responses flowing are very encouraging. Thanks Tejas.
> Tejas, as I mentioned earlier, in fact I actually ran it step by step.
>
> So first I ran the inject command and then the readdb with dump option and
> did not see anything in the dump files, that leads me to say that the
> inject did not work.I verified the regex-urlfilter and made sure that my
> url is not getting filtered.
>
>  and you see nothing interesting in the logs. Oh boy... If this happens
w/o any config changes over the distribution (apart from http.agent.name),
then it should have been reported by now. You might set the loggers to
lower level to get more details. I have a feeling that mostly the reason is
the datastore used is buggy.


I agree that the second link is about configuring HBase as a storageDB.
> However, I do not have Hbase installed and dont foresee getting it
> installed any sooner, hence using HBase for storage is not a option, so I
> am going to have to stick to Gora with memory store.
>
> Ok. There were Jiras logged regarding memory store not working correctly
(it was in reference to junits being failing). Lewis / Renato might have
more knowledge about it. Being honest, I doubt it anybody .. out there ..
is actually using memstore. HBase seems to be the most cheered backend.

>
>
>
> On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected]
> >wrote:
>
> > On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
> >
> > > Thanks for the response Lewis.
> > > I did read these links, I mostly followed the first link and tried both
> > the
> > > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
> exception
> > on
> > > solr, so I figured that I should first deal with getting the crawl part
> > to
> > > work and then deal with solr indexing. Hence I went back to trying it
> > > stepwise.
> > >
> >
> > You should try running the crawl using individual commands and see where
> > the problem is. The nutch tutorial which Lewis pointed you to had those
> > commands. Even peeking into the bin/crawl script would also help as it
> > calls the nutch commands.
> >
> > >
> > > As for the second link, it is more about using HBase as store instead
> of
> > > gora. This is not really a option for me yet, cause my grid does not
> have
> > > hbase installed yet. Getting it done is not much under my control
> > >
> >
> > HBase is one of the datastores supported by Apache Gora. That tutorial
> > speaks about how to configure Nutch (actually Gora) to use HBase as a
> > backend. So, its wrong to say that the tutorial was about HBase and not
> > Gora.
> >
> > >
> > > the FAQ link is the one I had not gone through until I checked your
> > > response, but I do not find answers to any of my questions
> > > (directly/indirectly) in it.
> > >
> >
> > Ok
> >
> > >
> > >
> > >
> > >
> > > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > > > Hi Hemant,
> > > > I strongly advise you to take some time to look through the Nutch
> > > Tutorial
> > > > for 1.x and 2.x.
> > > > http://wiki.apache.org/nutch/NutchTutorial
> > > > http://wiki.apache.org/nutch/Nutch2Tutorial
> > > > Also please see the FAQ's, which you will find very very useful.
> > > > http://wiki.apache.org/nutch/FAQ
> > > >
> > > > Thanks
> > > > Lewis
> > > >
> > > >
> > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:
> > > >
> > > > > Hi,
> > > > > I am first time user of nutch. I installed
> > > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
> > > > > webpage.
> > > > >
> > > > > I am running nutch step by step. These are the problems I came
> > across -
> > > > >
> > > > > 1. Inject did not work, i..e the url does not reflect in the
> > > > > webdb(gora-memstore). The way I verify this is after running
> inject,
> > i
> > > > run
> > > > > readdb with dump. This created a directory in hdfs with 0 size part
> > > file.
> > > > >
> > > > > 2. config files - This confused me a lot. When run from deploy
> > > directory,
> > > > > does nutch use the config files from local/conf? Changes made to
> > > > > local/conf/nutch-site.xml did not take effect after editing this
> > file.
> > > I
> > > > > had to edit this in order to get rid of the 'http.agent.name'
> > error. I
> > > > > finally ended up hard-coding this in the code, rebuilding and
> running
> > > to
> > > > > keep going forward.
> > > > >
> > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot out
> > > > output
> > > > > but I do not see my url from seed.txt in there. So I do not know if
> > the
> > > > > entry in webdb actually reflects my seed.txt at all or not.
> > > > >
> > > > > 4. logs - When nutch is run from the deploy directory, the
> > > > logs/hadoop.log
> > > > > is not generated anymore, not locally, nor on the grid. I tried to
> > make
> > > > it
> > > > > verbose by changing log4j.properties to DEBUG, but still had not
> file
> > > > > generated.
> > > > >
> > > > > Any help with this would help me move forward with nutch.
> > > > >
> > > > > Regards
> > > > > Hemant
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
>

Re: Questions/issues with nutch

Reply via email to