Re: HTML parsing

susan_levine 13 Mar 2000 20:40:44 -0000


Can anyone send us links to the Sun and IBM versions please?  Are these
Java or C++ implementations?  I'm using the java Tidy  html parser because
the OpenXML one is way too unforgiving of unwellformed HTML.  But I'd
prefer to use something else (also java) because Tidy's not built for
speed.


Thanks in advance,

--Susan





                                                                                
                                   
                    Mike Pogue                                                  
                                   
                    <[EMAIL PROTECTED]        To:     [EMAIL PROTECTED]         
                                
                    e.org>               cc:                                    
                                   
                                         Subject:     Re: HTML parsing          
                                   
                    03/13/00                                                    
                                   
                    11:41 AM                                                    
                                   
                    Please                                                      
                                   
                    respond to                                                  
                                   
                    xerces-dev                                                  
                                   
                                                                                
                                   
                                                                                
                                   





Note that we have a couple of people who would like to donate an
HTML parser to xml.apache.org, to be added to Xerces.  The ones I know of
are:

     ExOffice (extremely well tested, used for web spiders),
     Sun (I haven't seen it yet), and
     IBM (I haven't seen it yet either).

I suspect that if people are interested in this, we ought to have people
look at all three,
and figure out whether one is better, or whether they should be merged
somehow before
being checked in...assuming there's interest in this!

Any volunteers?

Mike

Cox Andy wrote:
>
> If the HTML is not well-formed XML (which most is not), you are correct.
>
> Andy
>
> | -----Original Message-----
> | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> | Sent: Monday, March 13, 2000 10:32 AM
> | To: [EMAIL PROTECTED]
> | Subject: HTML parsing
> |
> |
> | For what I can tell, I cannot expect to be able to parse an HTML doc
with
> | the xerces parser?  I was hoping to use the C++ SAX parser to find
<IMG>
> | tags but I don't think I will be able to do that.  Can someone confirm
> this
> | dreadful fact?

Re: HTML parsing

Reply via email to