Yes.

I have noticed sometimes when i want a new crawl and there are already
records present in the database, the crawl does not go as expected.

I generally drop the table (hbase) and run the crawl again.

Also, please use crawl script instead of nutch script to start crawls [0]


[0] -
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862



On Wed, Jan 30, 2013 at 3:03 PM, Weilei Zhang <[email protected]> wrote:

> It seems that I understand this problem now: this comes from the prior
> fetch(es).
> I need to find some way to reset the database if I want to execute a
> fresh crawl, right?
> Sorry if this is too basic a question. This is only my 4th day into
> Nutch/Hadoop/Hbase though I have been a Java programmer for a while.
> Thanks
> -Weilei
>
>
> On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang <[email protected]> wrote:
> > Hi
> > I am trying to use Nutch 2.x and have one question regarding Generator
> > and Injector:
> > Basically, I only have link as root to crawl and I see (by
> > instrumenting the code) that this one link was written to Context in
> > the last step of InjectorJob and that is the only link written to
> > Context from GeneratorJob. However, I saw multiple links sent to map
> > function  in the first steps of GeneratorJob ( I instrumented setup
> > function). Those links seem to include all URLs referenced from the
> > original link. My question is where does fetch/parse happen? From the
> > Crawler code, it is straightforward to me that Injector is immediately
> > followed by Generator; I tried to scrub the code down to do the job
> > but failed.
> >
> > I ran crawl in the following way:
> >>/nutch  crawl urlsDir
> >
> > There is only one link under a file in urlsDir.
> >>cat urlsDir/*
> > http://www.bmw.com
> >
> > The following is excerpt from the Generator map function
> > instrumentation output. Those are reversedURL.
> > al.com.bmw.www:http/
> > al.com.bmw.www:http/al/en
> > am.bmw.www:http/
> > am.bmw.www:http/am/en
> > ao.co.bmw:http/
> > ao.co.bmw:http/ao/pt
> > ar.com.bmw.www:http/
> > ar.com.bmw.www:http/ar/es/
> > at.bmw.www:http/
> > at.bmw.www:http/at/de/general/configurations_center/configure.html
> > at.bmw.www:http/de/index.html
> >
> at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
> > au.com.bmw.www:http/
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/
> >
> >
> > Thanks for any hints!
> > --
> > Best Regards
> > -Weilei
>
>
>
> --
> Best Regards
> -Weilei
>



-- 
Kiran Chitturi

Reply via email to