Wait, Tagsoup is not returning the start element events in the same order as 
the html?  I don’t know think we can fix that or your other points, but would 
you be willing to share triggering documents and open an issue for each problem.

We should include those issues in our ongoing conversation about swapping out 
the underlying html parser for something more modern.

Sorry Tika isn’t working for you on this, and thank you!

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Friday, June 30, 2017 1:23 AM
To: user@tika.apache.org
Subject: RE: HTML parsing, script tags,

Well I got a long way with the Tika wrapper around tag soup but then while 
chasing down a bug I realized that I was not getting the startElement events in 
the order that they are seen in the HTML file. It also ignores <!doctype> and 
unknown elements.

I can’t see anyway to change that and as knowing the structure of the document 
is very important then I will have to stop using Tika for HTML I guess and go 
back to validator.nu

Just posting this here for posterity really.

Jim

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Wednesday, June 28, 2017 23:06
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: HTML parsing, script tags,

Hi Jim,

On Jun 28, 2017, at 12:07am, Jim Idle 
<ji...@proofpoint.com<mailto:ji...@proofpoint.com>> wrote:

So right now it looks the HTML parser only sends through script tags if the hay 
a src attribute. Is this likely to change or should I use another parser for 
HTML? I could submit a patch for this of course.

You can use a custom mapper if you want to alter which tags get passed through.

E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through 
everything.


Also, does anyone have an opinion if the underlying tag soup stuff is tolerant 
of HTML in a similar manner to browsers which will try to render anything) or 
is expecting well-formed HTML. I can go look at the Tag Soup stuff directly of 
course, but just wondered if anyone has experience of using Tika to parse HTML.

TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up 
broken HTML, with varying degrees of success, depending on the way that HTML is 
broken.

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr


Reply via email to