Re: Questions/issues with nutch

Tejas Patil Wed, 26 Jun 2013 23:17:05 -0700

Hi Lewis,
Thanks for details.

One quickie: By using memstore as the datastore, will the results be
persisted across runs ? I mean, after injecting stuff, where would the
crawl datums get stored on to the disk so that the generate phase gets
those ? I believe that memstore won't do it and would give up everything
once the process ends.



On Wed, Jun 26, 2013 at 11:06 PM, Tejas Patil <[email protected]>wrote:

> On Wed, Jun 26, 2013 at 10:26 PM, h b <[email protected]> wrote:
>
>> The quick responses flowing are very encouraging. Thanks Tejas.
>> Tejas, as I mentioned earlier, in fact I actually ran it step by step.
>>
>> So first I ran the inject command and then the readdb with dump option and
>> did not see anything in the dump files, that leads me to say that the
>> inject did not work.I verified the regex-urlfilter and made sure that my
>> url is not getting filtered.
>>
>>  and you see nothing interesting in the logs. Oh boy... If this happens
> w/o any config changes over the distribution (apart from http.agent.name),
> then it should have been reported by now. You might set the loggers to
> lower level to get more details. I have a feeling that mostly the reason is
> the datastore used is buggy.
>
> I agree that the second link is about configuring HBase as a storageDB.
>> However, I do not have Hbase installed and dont foresee getting it
>> installed any sooner, hence using HBase for storage is not a option, so I
>> am going to have to stick to Gora with memory store.
>>
>> Ok. There were Jiras logged regarding memory store not working correctly
> (it was in reference to junits being failing). Lewis / Renato might have
> more knowledge about it. Being honest, I doubt it anybody .. out there ..
> is actually using memstore. HBase seems to be the most cheered backend.
>
>>
>>
>>
>> On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected]
>> >wrote:
>>
>> > On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
>> >
>> > > Thanks for the response Lewis.
>> > > I did read these links, I mostly followed the first link and tried
>> both
>> > the
>> > > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
>> exception
>> > on
>> > > solr, so I figured that I should first deal with getting the crawl
>> part
>> > to
>> > > work and then deal with solr indexing. Hence I went back to trying it
>> > > stepwise.
>> > >
>> >
>> > You should try running the crawl using individual commands and see where
>> > the problem is. The nutch tutorial which Lewis pointed you to had those
>> > commands. Even peeking into the bin/crawl script would also help as it
>> > calls the nutch commands.
>> >
>> > >
>> > > As for the second link, it is more about using HBase as store instead
>> of
>> > > gora. This is not really a option for me yet, cause my grid does not
>> have
>> > > hbase installed yet. Getting it done is not much under my control
>> > >
>> >
>> > HBase is one of the datastores supported by Apache Gora. That tutorial
>> > speaks about how to configure Nutch (actually Gora) to use HBase as a
>> > backend. So, its wrong to say that the tutorial was about HBase and not
>> > Gora.
>> >
>> > >
>> > > the FAQ link is the one I had not gone through until I checked your
>> > > response, but I do not find answers to any of my questions
>> > > (directly/indirectly) in it.
>> > >
>> >
>> > Ok
>> >
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
>> > > [email protected]> wrote:
>> > >
>> > > > Hi Hemant,
>> > > > I strongly advise you to take some time to look through the Nutch
>> > > Tutorial
>> > > > for 1.x and 2.x.
>> > > > http://wiki.apache.org/nutch/NutchTutorial
>> > > > http://wiki.apache.org/nutch/Nutch2Tutorial
>> > > > Also please see the FAQ's, which you will find very very useful.
>> > > > http://wiki.apache.org/nutch/FAQ
>> > > >
>> > > > Thanks
>> > > > Lewis
>> > > >
>> > > >
>> > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:
>> > > >
>> > > > > Hi,
>> > > > > I am first time user of nutch. I installed
>> > > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a
>> single
>> > > > > webpage.
>> > > > >
>> > > > > I am running nutch step by step. These are the problems I came
>> > across -
>> > > > >
>> > > > > 1. Inject did not work, i..e the url does not reflect in the
>> > > > > webdb(gora-memstore). The way I verify this is after running
>> inject,
>> > i
>> > > > run
>> > > > > readdb with dump. This created a directory in hdfs with 0 size
>> part
>> > > file.
>> > > > >
>> > > > > 2. config files - This confused me a lot. When run from deploy
>> > > directory,
>> > > > > does nutch use the config files from local/conf? Changes made to
>> > > > > local/conf/nutch-site.xml did not take effect after editing this
>> > file.
>> > > I
>> > > > > had to edit this in order to get rid of the 'http.agent.name'
>> > error. I
>> > > > > finally ended up hard-coding this in the code, rebuilding and
>> running
>> > > to
>> > > > > keep going forward.
>> > > > >
>> > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot
>> out
>> > > > output
>> > > > > but I do not see my url from seed.txt in there. So I do not know
>> if
>> > the
>> > > > > entry in webdb actually reflects my seed.txt at all or not.
>> > > > >
>> > > > > 4. logs - When nutch is run from the deploy directory, the
>> > > > logs/hadoop.log
>> > > > > is not generated anymore, not locally, nor on the grid. I tried to
>> > make
>> > > > it
>> > > > > verbose by changing log4j.properties to DEBUG, but still had not
>> file
>> > > > > generated.
>> > > > >
>> > > > > Any help with this would help me move forward with nutch.
>> > > > >
>> > > > > Regards
>> > > > > Hemant
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > *Lewis*
>> > > >
>> > >
>> >
>>
>
>

Re: Questions/issues with nutch

Reply via email to