Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Tony Mullins Tue, 18 Jun 2013 10:52:05 -0700

Guyz, Its a serious issue, could any one plz help me here that why its
doing so ?

I have just crawled www.google.nl and www.bing.com  and in my log file ( I
am logging the html in my ParseFilter) I can see the html of each url
coming twice.

How can one extract his/her required information in this case when
webPage.getContent().array() is returning the html of all urls in seed.txt
!!!

Tony. :(

On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected]>wrote:

> Hi,
>
> I have two urls in my seed.txt url1 & url2. When Nutch runs my ParseFilter
> plugin and I can see that I am in url1 by checking webPage.getBaseUrl(). At
> that point if I do
> String html = new String(webPage.getContent().array());
>
> It returns my the html of both url1 & url 2.
>
> And when my ParseFIlter is again run for url2 and I can see that by
> checking webPage.getBaseUrl() == url2
>
> it again return me the html of both pages (url1 & url2)...
>
> Why its doing so ?
> How to get the html of only that url for which ParseFilter is currently
> running ?
>
> Please any help here !!!
>
> Thanks,
> Tony.
>

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to