Hi

I checked the ParseFilter interface in Nutch 2.x like this.

Parse filter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags,
DocumentFragment doc);

you can through this method to get the raw content of html page.

String content = new String(page.getContent().array());

and get the parsed text through parse.getText() method.





On Thu, Jun 13, 2013 at 11:10 PM, Jamshaid Ashraf <[email protected]>wrote:

> Hi,
>
> Since I'm using nutch 2.2 ParseFilter plugin and I need to extract custom
> information from parsed raw html (preferably using JSoup) ... but I still
> could't find out how to get the raw html in @override filter () method . As
> all the examples I have found are in Nutch 1.x api and doens't work with
> new Nutch 2.x api.
>
>
> Thanks in advance!
>
> Regards,
> Jamshaid
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to