Apparently meta tags get thrown out because they are mapped to the HEAD group
in html.tssl in TagSoup. If i replace the META element in the schema with a
group of 255 (belongs to anything) the unit test passes but other HtmlParser
tests fail. Although bad practice we still want to support these kind of
microdata elements.
We could try to expose the schema to so external code can override TagSoup's
schema elementType's. This would allow my unit test and others to pass.
Any advice on what to do next?
<failure message="expected:<Tika Developers> but was:<null>"
type="junit.framework.ComparisonFailure">junit.framework.ComparisonFailure:
expected:<Tika Developers> but was:<null>
at junit.framework.Assert.assertEquals(Assert.java:85)
at junit.framework.Assert.assertEquals(Assert.java:91)
at
org.apache.tika.parser.html.HtmlParserTest.testParseAscii(HtmlParserTest.java:81)
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Tue 28-Aug-2012 14:48
> To: [email protected]
> Subject: Meta tag in body, what does Tika do to them?
>
> Hi,
>
> We're testing TIKA-980 (MicrodataContentHandler for Apache Tika) and a lot of
> URL's work out just fine if microdata is implemented properly. But we're
> also seeing a lot of webmasters putting meta tags with microdata properties
> right in the body! They apparently read Google's webmaster page [1] about
> invisible microdata and went along adding meta tags to the body as if it's
> normal practice.
>
> Whenever the webmaster has for example:
>
> <meta content="EUR" itemprop="priceCurrency">
> <span itemprop="price">17.50</span>
>
> ..the MicrodataContentHandler trips over it and cannot assign price to an
> itemscope because the DOM seems to become reordered/normalized, even when i
> (in a test) properly close the meta tag. What does Tika do to meta tags in
> the content when using the IdentityHtmlMapper? How can we read the meta tag
> as if it's just another tag? Is there some switch or setting i've missed?
>
> [1]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146750
>
> Thanks,
> Markus
>