Parsing html

2010-05-04 Thread nachonieto3

Good afternoon,

Once I solved my problem with the other formats. Now I'm trying to figure
out how to solve another one.
I'm able to parse .html format but I get the ParseText in one line. I would
like to respect at least the paragraphs of the original document. Anyone
know how to do it?
Thank you in advance.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Parsing-html-tp776487p776487.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: java.nio.BufferOverflowException while parsing html contents

2005-12-21 Thread Jérôme Charron
Could you please provide more informations:
* Stack trace
* Attach the SecurityFacade.html file (in order to reproduce)
* Some piece of information about context : config, nutch version, ...

Regards

Jérôme


On 12/21/05, Arun Kumar Sharma <[EMAIL PROTECTED]> wrote:
>
> Hi,
>   I am getting java.nio.BufferOverflowException error while parsing html
> content . Can u suggest me the way out ??
>
> Parsing
> file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html]
> with [EMAIL PROTECTED]
> java.nio.BufferOverflowException
>
>
>  Regards,
>
> Arun Kumar Sharma (Tech Lead -Java/J2EE)
> Mob: +91.981.529.5761
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
>



--
http://motrech.free.fr/
http://www.frutch.org/


java.nio.BufferOverflowException while parsing html contents

2005-12-20 Thread Arun Kumar Sharma
  Hi,    I am getting java.nio.BufferOverflowException error while parsing html content . Can u suggest me the way out ??     Parsing file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html] with [EMAIL PROTECTED]java.nio.BufferOverflowException

Regards,
 
Arun Kumar Sharma (Tech Lead -Java/J2EE)Mob: +91.981.529.5761Send instant messages to your online friends http://in.messenger.yahoo.com 

RE: Parsing HTML meta tags

2005-09-30 Thread Paul Williams
For the record -

Override implementations of BasicIndexingFilter and HtmlParser so that
the metadata tags are returned.  Can then parse for Dublin Core content
and discard the rest.

-Original Message-
From: Paul Williams 
Sent: 28 September 2005 10:51
To: nutch-user@lucene.apache.org
Subject: Parsing HTML meta tags

Hi,

 

I'm trying to parse an external site that contains meta tags encoded in
the HTML, such as:

 

BBC - GCSE Bitesize - Homepage


 
Nutch is able to see the pages but I'm not getting any of the meta tags
indexed.  Is there a way to do this?
 
Thanks,
Paul.

 






Parsing HTML meta tags

2005-09-28 Thread Paul Williams
Hi,

 

I'm trying to parse an external site that contains meta tags encoded in
the HTML, such as:

 

BBC - GCSE Bitesize - Homepage


 
Nutch is able to see the pages but I'm not getting any of the meta tags
indexed.  Is there a way to do this?
 
Thanks,
Paul.