It actually seems (to become) legitimate to have it in the body but it would 
break existing tests. Should the tests be updated if the meta and link tags are 
going to be allowed in the body?

http://dev.w3.org/html5/md/#content-models
If the itemprop attribute is present on link or meta, they are flow content and 
phrasing content. The link and meta elements may be used where phrasing content 
is expected if the itemprop attribute is present.

 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Thu 30-Aug-2012 18:59
> To: [email protected]
> Subject: RE: Meta tag in body, what does Tika do to them?
> 
> Apparently meta tags get thrown out because they are mapped to the HEAD group 
> in html.tssl in TagSoup.  If i replace the META element in the schema with a 
> group of 255 (belongs to anything) the unit test passes but other HtmlParser 
> tests fail. Although bad practice we still want to support these kind of 
> microdata elements.
> 
> We could try to expose the schema to so external code can override TagSoup's 
> schema elementType's. This would allow my unit test and others to pass.
> 
> Any advice on what to do next? 
> 
> <failure message="expected:&lt;Tika Developers&gt; but was:&lt;null&gt;" 
> type="junit.framework.ComparisonFailure">junit.framework.ComparisonFailure: 
> expected:&lt;Tika Developers&gt; but was:&lt;null&gt;
>         at junit.framework.Assert.assertEquals(Assert.java:85)
>         at junit.framework.Assert.assertEquals(Assert.java:91)
>         at 
> org.apache.tika.parser.html.HtmlParserTest.testParseAscii(HtmlParserTest.java:81)
>  
> -----Original message-----
> > From:Markus Jelsma <[email protected]>
> > Sent: Tue 28-Aug-2012 14:48
> > To: [email protected]
> > Subject: Meta tag in body, what does Tika do to them?
> > 
> > Hi,
> > 
> > We're testing TIKA-980 (MicrodataContentHandler for Apache Tika) and a lot 
> > of URL's work out just fine if microdata is implemented properly.  But 
> > we're also seeing a lot of webmasters putting meta tags with microdata 
> > properties right in the body! They apparently read Google's webmaster page 
> > [1] about invisible microdata and went along adding meta tags to the body 
> > as if it's normal practice.
> > 
> > Whenever the webmaster has for example:
> > 
> >                 <meta content="EUR" itemprop="priceCurrency">
> >                 <span itemprop="price">17.50</span>
> > 
> > ..the MicrodataContentHandler trips over it and cannot assign price to an 
> > itemscope because the DOM seems to become reordered/normalized,  even when 
> > i (in a test) properly close the meta tag. What does Tika do to meta tags 
> > in the content when using the IdentityHtmlMapper? How can we read the meta 
> > tag as if it's just another tag? Is there some switch or setting i've 
> > missed?
> > 
> > [1]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146750
> > 
> > Thanks,
> > Markus
> > 
> 

Reply via email to