Please check out https://issues.apache.org/jira/browse/NUTCH-978 and give feedback on how this may/could work for you
Thank you hth Lewis On Thu, Jul 5, 2012 at 6:19 PM, Sandeep C R <[email protected]> wrote: > Hello, > > I am trying to get contents from this webpage. (Just an example. Similarly > I am crawling many other webpages) > http://www.osc.edu/supercomputing/software/apps/amber.shtml > > Later I do some customized indexing by sending contents of this webpage by > writing a plugin which implements IndexingFilter. However, there are many > unnecessary things in contents like text from other links. i.e Contact us, > Support, Visit time etc. I am just interested in the main contents of this > webpage. Is there anyway to get just these contents? > > Also, is there a way to parse only specified tags like <p><table> etc. And > I also want to insert a full stop after every <br> and few other tags. All > these are requirement for the customized indexer I am using. Is it possible > to achieve something like this? Thank you. > > Regards, > Sandeep -- Lewis

