Yes. I have noticed sometimes when i want a new crawl and there are already records present in the database, the crawl does not go as expected.
I generally drop the table (hbase) and run the crawl again. Also, please use crawl script instead of nutch script to start crawls [0] [0] - https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862 On Wed, Jan 30, 2013 at 3:03 PM, Weilei Zhang <[email protected]> wrote: > It seems that I understand this problem now: this comes from the prior > fetch(es). > I need to find some way to reset the database if I want to execute a > fresh crawl, right? > Sorry if this is too basic a question. This is only my 4th day into > Nutch/Hadoop/Hbase though I have been a Java programmer for a while. > Thanks > -Weilei > > > On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang <[email protected]> wrote: > > Hi > > I am trying to use Nutch 2.x and have one question regarding Generator > > and Injector: > > Basically, I only have link as root to crawl and I see (by > > instrumenting the code) that this one link was written to Context in > > the last step of InjectorJob and that is the only link written to > > Context from GeneratorJob. However, I saw multiple links sent to map > > function in the first steps of GeneratorJob ( I instrumented setup > > function). Those links seem to include all URLs referenced from the > > original link. My question is where does fetch/parse happen? From the > > Crawler code, it is straightforward to me that Injector is immediately > > followed by Generator; I tried to scrub the code down to do the job > > but failed. > > > > I ran crawl in the following way: > >>/nutch crawl urlsDir > > > > There is only one link under a file in urlsDir. > >>cat urlsDir/* > > http://www.bmw.com > > > > The following is excerpt from the Generator map function > > instrumentation output. Those are reversedURL. > > al.com.bmw.www:http/ > > al.com.bmw.www:http/al/en > > am.bmw.www:http/ > > am.bmw.www:http/am/en > > ao.co.bmw:http/ > > ao.co.bmw:http/ao/pt > > ar.com.bmw.www:http/ > > ar.com.bmw.www:http/ar/es/ > > at.bmw.www:http/ > > at.bmw.www:http/at/de/general/configurations_center/configure.html > > at.bmw.www:http/de/index.html > > > at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html > > au.com.bmw.www:http/ > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/ > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/ > > > > > > Thanks for any hints! > > -- > > Best Regards > > -Weilei > > > > -- > Best Regards > -Weilei > -- Kiran Chitturi

