It seems that I understand this problem now: this comes from the prior fetch(es). I need to find some way to reset the database if I want to execute a fresh crawl, right? Sorry if this is too basic a question. This is only my 4th day into Nutch/Hadoop/Hbase though I have been a Java programmer for a while. Thanks -Weilei
On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang <[email protected]> wrote: > Hi > I am trying to use Nutch 2.x and have one question regarding Generator > and Injector: > Basically, I only have link as root to crawl and I see (by > instrumenting the code) that this one link was written to Context in > the last step of InjectorJob and that is the only link written to > Context from GeneratorJob. However, I saw multiple links sent to map > function in the first steps of GeneratorJob ( I instrumented setup > function). Those links seem to include all URLs referenced from the > original link. My question is where does fetch/parse happen? From the > Crawler code, it is straightforward to me that Injector is immediately > followed by Generator; I tried to scrub the code down to do the job > but failed. > > I ran crawl in the following way: >>/nutch crawl urlsDir > > There is only one link under a file in urlsDir. >>cat urlsDir/* > http://www.bmw.com > > The following is excerpt from the Generator map function > instrumentation output. Those are reversedURL. > al.com.bmw.www:http/ > al.com.bmw.www:http/al/en > am.bmw.www:http/ > am.bmw.www:http/am/en > ao.co.bmw:http/ > ao.co.bmw:http/ao/pt > ar.com.bmw.www:http/ > ar.com.bmw.www:http/ar/es/ > at.bmw.www:http/ > at.bmw.www:http/at/de/general/configurations_center/configure.html > at.bmw.www:http/de/index.html > at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html > au.com.bmw.www:http/ > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/ > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/ > > > Thanks for any hints! > -- > Best Regards > -Weilei -- Best Regards -Weilei

