I will do. I think that it would be very worthwhile to add the nu parser as a 
configurable option as that is both tolerant of errors if you want it to be and 
SAX compatible etc. The existing parser also seems good but it is more designed 
to dissect the HTML than raise events in the order that SAX would expect. That 
works for a lot of cases but not for mine, so I use the same content handler as 
for Tika and just use:

        <dependency>
            <groupId>nu.validator</groupId>
            <artifactId>htmlparser</artifactId>
            <version>1.4.6</version>
        </dependency>

For the actual parse. I will try to find a document that shows the difference 
but it definitely raises events out of order. I did post about it at the time 
but doubt that I included a document - I will try to do so.

Jim


> -----Original Message-----
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Tuesday, November 28, 2017 21:03
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
> 
> 
> 
> >As the HTML parser in Tika does not produce SAX events in the correct
> order - the parser is great but does not support serialization - etc.
> 
> Oh, please open a ticket with examples, or point me to one I've forgotten
> about... ☹  Thank you!

Reply via email to