Hi, What I meant to say here is "html of current webPage.getBaseUrl() = ( i.e current url of ParseFilter ) = ( http://www.google.nl) ==> meaning instead of returning html of current parsing request i.e "http://www.google.nl" it is returning me the html of both seeds (url) i.e http://www.google.nl + http://www. bing.com !
Please see the attached hadoop log file for details. When I am in ParseFilter I log the html of current webpage. Please look for string "INFO nutch.selector - page html is ..." you will see there are two such string in log file (each for seed) and each contains the html contents of both google.nl and bing.com. Its very strange for me as well that why this is happening but it is happening on my end. On Wed, Jun 19, 2013 at 1:51 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Tony, > > On Tue, Jun 18, 2013 at 11:49 AM, Tony Mullins <[email protected] > >wrote: > > > ...instead > > of returning html of the current page it is returning me the url of all > the > > pages in seed.txt > > > > I suspect that this should not be happening at all! > > > > > > Could you please try entering 2 or more urls in seed.txt and and then > > get webPage.getContent().array() > > in your ParseFilter plugin .... then you will see instead of returning > the > > html of current webPage.getBaseUrl() , it is returning the html of all > urls > > of seed.txt. > > > > This does not make sense Tony. When would a call to > page.getContent().array() return you page.getBaseUrl()? > If you want the BaseUrl() just call getBaseUrl(). If you want page HTML go > .getContent(), why are you involving .getBaseUrl()? >

