Please take a look at the WebTableReader [0] Tony at around lines 408 - 420.
This works perfectly for dumps of my webdb in Cassandra and should work
well for you.
hth

[0]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?revision=1485846&view=markup


On Tue, Jun 18, 2013 at 10:50 AM, Tony Mullins <[email protected]>wrote:

> Guyz, Its a serious issue, could any one plz help me here that why its
> doing so ?
>
> I have just crawled www.google.nl and www.bing.com  and in my log file ( I
> am logging the html in my ParseFilter) I can see the html of each url
> coming twice.
>
> How can one extract his/her required information in this case when
> webPage.getContent().array() is returning the html of all urls in seed.txt
> !!!
>
> Tony. :(
>
>
>
> On Tue, Jun 18, 2013 at 2:45 PM, Tony Mullins <[email protected]
> >wrote:
>
> > Hi,
> >
> > I have two urls in my seed.txt url1 & url2. When Nutch runs my
> ParseFilter
> > plugin and I can see that I am in url1 by checking webPage.getBaseUrl().
> At
> > that point if I do
> > String html = new String(webPage.getContent().array());
> >
> > It returns my the html of both url1 & url 2.
> >
> > And when my ParseFIlter is again run for url2 and I can see that by
> > checking webPage.getBaseUrl() == url2
> >
> > it again return me the html of both pages (url1 & url2)...
> >
> > Why its doing so ?
> > How to get the html of only that url for which ParseFilter is currently
> > running ?
> >
> > Please any help here !!!
> >
> > Thanks,
> > Tony.
> >
>



-- 
*Lewis*

Reply via email to