I will do. I think that it would be very worthwhile to add the nu parser as a configurable option as that is both tolerant of errors if you want it to be and SAX compatible etc. The existing parser also seems good but it is more designed to dissect the HTML than raise events in the order that SAX would expect. That works for a lot of cases but not for mine, so I use the same content handler as for Tika and just use:
<dependency> <groupId>nu.validator</groupId> <artifactId>htmlparser</artifactId> <version>1.4.6</version> </dependency> For the actual parse. I will try to find a document that shows the difference but it definitely raises events out of order. I did post about it at the time but doubt that I included a document - I will try to do so. Jim > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Tuesday, November 28, 2017 21:03 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF files > > > > >As the HTML parser in Tika does not produce SAX events in the correct > order - the parser is great but does not support serialization - etc. > > Oh, please open a ticket with examples, or point me to one I've forgotten > about... ☹ Thank you!