Please check out https://issues.apache.org/jira/browse/NUTCH-978 and
give feedback on how this may/could work for you

Thank you hth

Lewis

On Thu, Jul 5, 2012 at 6:19 PM, Sandeep C R <[email protected]> wrote:
> Hello,
>
> I am trying to get contents from this webpage. (Just an example. Similarly
> I am crawling many other webpages)
> http://www.osc.edu/supercomputing/software/apps/amber.shtml
>
> Later I do some customized indexing by sending contents of this webpage by
> writing a plugin which implements IndexingFilter. However, there are many
> unnecessary things in contents like text from other links. i.e Contact us,
> Support, Visit time etc. I am just interested in the main contents of this
> webpage. Is there anyway to get just these contents?
>
> Also, is there a way to parse only specified tags like <p><table> etc. And
> I also want to insert a full stop after every <br> and few other tags. All
> these are requirement for the customized indexer I am using. Is it possible
> to achieve something like this? Thank you.
>
> Regards,
> Sandeep



-- 
Lewis

Reply via email to