Guyz, Its a serious issue, could any one plz help me here that why its doing so ?
I have just crawled www.google.nl and www.bing.com and in my log file ( I am logging the html in my ParseFilter) I can see the html of each url coming twice. How can one extract his/her required information in this case when webPage.getContent().array() is returning the html of all urls in seed.txt !!! Tony. :( On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected]>wrote: > Hi, > > I have two urls in my seed.txt url1 & url2. When Nutch runs my ParseFilter > plugin and I can see that I am in url1 by checking webPage.getBaseUrl(). At > that point if I do > String html = new String(webPage.getContent().array()); > > It returns my the html of both url1 & url 2. > > And when my ParseFIlter is again run for url2 and I can see that by > checking webPage.getBaseUrl() == url2 > > it again return me the html of both pages (url1 & url2)... > > Why its doing so ? > How to get the html of only that url for which ParseFilter is currently > running ? > > Please any help here !!! > > Thanks, > Tony. >

