Parsing html
Good afternoon, Once I solved my problem with the other formats. Now I'm trying to figure out how to solve another one. I'm able to parse .html format but I get the ParseText in one line. I would like to respect at least the paragraphs of the original document. Anyone know how to do it? Thank you in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Parsing-html-tp776487p776487.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: java.nio.BufferOverflowException while parsing html contents
Could you please provide more informations: * Stack trace * Attach the SecurityFacade.html file (in order to reproduce) * Some piece of information about context : config, nutch version, ... Regards Jérôme On 12/21/05, Arun Kumar Sharma <[EMAIL PROTECTED]> wrote: > > Hi, > I am getting java.nio.BufferOverflowException error while parsing html > content . Can u suggest me the way out ?? > > Parsing > file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html] > with [EMAIL PROTECTED] > java.nio.BufferOverflowException > > > Regards, > > Arun Kumar Sharma (Tech Lead -Java/J2EE) > Mob: +91.981.529.5761 > > Send instant messages to your online friends http://in.messenger.yahoo.com > -- http://motrech.free.fr/ http://www.frutch.org/
java.nio.BufferOverflowException while parsing html contents
Hi, I am getting java.nio.BufferOverflowException error while parsing html content . Can u suggest me the way out ?? Parsing file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html] with [EMAIL PROTECTED]java.nio.BufferOverflowException Regards, Arun Kumar Sharma (Tech Lead -Java/J2EE)Mob: +91.981.529.5761Send instant messages to your online friends http://in.messenger.yahoo.com
RE: Parsing HTML meta tags
For the record - Override implementations of BasicIndexingFilter and HtmlParser so that the metadata tags are returned. Can then parse for Dublin Core content and discard the rest. -Original Message- From: Paul Williams Sent: 28 September 2005 10:51 To: nutch-user@lucene.apache.org Subject: Parsing HTML meta tags Hi, I'm trying to parse an external site that contains meta tags encoded in the HTML, such as: BBC - GCSE Bitesize - Homepage Nutch is able to see the pages but I'm not getting any of the meta tags indexed. Is there a way to do this? Thanks, Paul.
Parsing HTML meta tags
Hi, I'm trying to parse an external site that contains meta tags encoded in the HTML, such as: BBC - GCSE Bitesize - Homepage Nutch is able to see the pages but I'm not getting any of the meta tags indexed. Is there a way to do this? Thanks, Paul.