Please take a look at the WebTableReader [0] Tony at around lines 408 - 420. This works perfectly for dumps of my webdb in Cassandra and should work well for you. hth
[0] http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?revision=1485846&view=markup On Tue, Jun 18, 2013 at 10:50 AM, Tony Mullins <[email protected]>wrote: > Guyz, Its a serious issue, could any one plz help me here that why its > doing so ? > > I have just crawled www.google.nl and www.bing.com and in my log file ( I > am logging the html in my ParseFilter) I can see the html of each url > coming twice. > > How can one extract his/her required information in this case when > webPage.getContent().array() is returning the html of all urls in seed.txt > !!! > > Tony. :( > > > > On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected] > >wrote: > > > Hi, > > > > I have two urls in my seed.txt url1 & url2. When Nutch runs my > ParseFilter > > plugin and I can see that I am in url1 by checking webPage.getBaseUrl(). > At > > that point if I do > > String html = new String(webPage.getContent().array()); > > > > It returns my the html of both url1 & url 2. > > > > And when my ParseFIlter is again run for url2 and I can see that by > > checking webPage.getBaseUrl() == url2 > > > > it again return me the html of both pages (url1 & url2)... > > > > Why its doing so ? > > How to get the html of only that url for which ParseFilter is currently > > running ? > > > > Please any help here !!! > > > > Thanks, > > Tony. > > > -- *Lewis*

