Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Tony Mullins Tue, 18 Jun 2013 22:39:54 -0700

Hi,

What I meant to say here is "html of current webPage.getBaseUrl() = ( i.e
current url of ParseFilter ) = ( http://www.google.nl) ==>  meaning instead
of returning html of current parsing request  i.e "http://www.google.nl"; it
is returning me the html of both seeds (url) i.e http://www.google.nl +
http://www. bing.com !

Please see the attached hadoop log file for details. When I am in
ParseFilter I log the html of current webpage. Please look for string
"INFO  nutch.selector - page html is ..." you will see there are two such
string in log file (each for seed) and each contains the html contents of
both google.nl and bing.com.

Its very strange for me as well that why this is happening but it is
happening on my end.

On Wed, Jun 19, 2013 at 1:51 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Tony,
>
> On Tue, Jun 18, 2013 at 11:49 AM, Tony Mullins <[email protected]
> >wrote:
>
> > ...instead
> > of returning html of the current page it is returning me the url of all
> the
> > pages in seed.txt
> >
>
> I suspect that this should not be happening at all!
>
>
> >
> > Could you please try entering 2 or more urls in seed.txt and and then
> > get webPage.getContent().array()
> > in your ParseFilter plugin .... then you will see instead of returning
> the
> > html of current webPage.getBaseUrl() , it is returning the html of all
> urls
> > of seed.txt.
> >
>
> This does not make sense Tony. When would a call to
> page.getContent().array() return you page.getBaseUrl()?
> If you want the BaseUrl() just call getBaseUrl(). If you want page HTML go
> .getContent(), why are you involving .getBaseUrl()?
>

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to