Lewis, I am also doing the same but in my ParseFilter plugin. And instead of returning html of the current page it is returning me the url of all the pages in seed.txt
Could you please try entering 2 or more urls in seed.txt and and then get webPage.getContent().array() in your ParseFilter plugin .... then you will see instead of returning the html of current webPage.getBaseUrl() , it is returning the html of all urls of seed.txt. Tony On Tue, Jun 18, 2013 at 11:18 PM, Lewis John Mcgibbney < [email protected]> wrote: > Please take a look at the WebTableReader [0] Tony at around lines 408 - > 420. > This works perfectly for dumps of my webdb in Cassandra and should work > well for you. > hth > > [0] > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?revision=1485846&view=markup > > > On Tue, Jun 18, 2013 at 10:50 AM, Tony Mullins <[email protected] > >wrote: > > > Guyz, Its a serious issue, could any one plz help me here that why its > > doing so ? > > > > I have just crawled www.google.nl and www.bing.com and in my log file > ( I > > am logging the html in my ParseFilter) I can see the html of each url > > coming twice. > > > > How can one extract his/her required information in this case when > > webPage.getContent().array() is returning the html of all urls in > seed.txt > > !!! > > > > Tony. :( > > > > > > > > On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected] > > >wrote: > > > > > Hi, > > > > > > I have two urls in my seed.txt url1 & url2. When Nutch runs my > > ParseFilter > > > plugin and I can see that I am in url1 by checking > webPage.getBaseUrl(). > > At > > > that point if I do > > > String html = new String(webPage.getContent().array()); > > > > > > It returns my the html of both url1 & url 2. > > > > > > And when my ParseFIlter is again run for url2 and I can see that by > > > checking webPage.getBaseUrl() == url2 > > > > > > it again return me the html of both pages (url1 & url2)... > > > > > > Why its doing so ? > > > How to get the html of only that url for which ParseFilter is currently > > > running ? > > > > > > Please any help here !!! > > > > > > Thanks, > > > Tony. > > > > > > > > > -- > *Lewis* >

