Lewis, I am also doing the same  but in my ParseFilter plugin. And instead
of returning html of the current page it is returning me the url of all the
pages in seed.txt

Could you please try entering 2 or more urls in seed.txt and and then
get webPage.getContent().array()
in your ParseFilter plugin .... then you will see instead of returning the
html of current webPage.getBaseUrl() , it is returning the html of all urls
of seed.txt.

Tony


On Tue, Jun 18, 2013 at 11:18 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Please take a look at the WebTableReader [0] Tony at around lines 408 -
> 420.
> This works perfectly for dumps of my webdb in Cassandra and should work
> well for you.
> hth
>
> [0]
>
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?revision=1485846&view=markup
>
>
> On Tue, Jun 18, 2013 at 10:50 AM, Tony Mullins <[email protected]
> >wrote:
>
> > Guyz, Its a serious issue, could any one plz help me here that why its
> > doing so ?
> >
> > I have just crawled www.google.nl and www.bing.com  and in my log file
> ( I
> > am logging the html in my ParseFilter) I can see the html of each url
> > coming twice.
> >
> > How can one extract his/her required information in this case when
> > webPage.getContent().array() is returning the html of all urls in
> seed.txt
> > !!!
> >
> > Tony. :(
> >
> >
> >
> > On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected]
> > >wrote:
> >
> > > Hi,
> > >
> > > I have two urls in my seed.txt url1 & url2. When Nutch runs my
> > ParseFilter
> > > plugin and I can see that I am in url1 by checking
> webPage.getBaseUrl().
> > At
> > > that point if I do
> > > String html = new String(webPage.getContent().array());
> > >
> > > It returns my the html of both url1 & url 2.
> > >
> > > And when my ParseFIlter is again run for url2 and I can see that by
> > > checking webPage.getBaseUrl() == url2
> > >
> > > it again return me the html of both pages (url1 & url2)...
> > >
> > > Why its doing so ?
> > > How to get the html of only that url for which ParseFilter is currently
> > > running ?
> > >
> > > Please any help here !!!
> > >
> > > Thanks,
> > > Tony.
> > >
> >
>
>
>
> --
> *Lewis*
>

Reply via email to