The issue is with TagSoup's schema where some HTML5 elements are missing. I 
fixed it for now by adding some elements to the schema in the (newly added) 
constructor of Tika's HtmlParser.

    public HtmlParser() {
        super();
        
        // Add some HTML5 elements
        HTML_SCHEMA.elementType("section", HTMLSchema.M_ANY, 255, 0);
        HTML_SCHEMA.elementType("article", HTMLSchema.M_ANY, 255, 0);
        HTML_SCHEMA.elementType("time", HTMLSchema.M_ANY, 255, 0);
    }

I used 255 as memberOf value because the group constants are not defined in the 
schema and i couldn't find their integer repr. in the html.tssl file in 
TagSoup. This is not a very elegant solution so how should it be solved? Having 
these elements returned is very important for the MicrodataContentHandler as 
many websites that implement microdata use it on HTML5 elements so the 
underlying parser must not throw them away.

Thanks,
Markus 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wed 29-Aug-2012 14:35
> To: [email protected]
> Subject: RE: Article and section tags
> 
> I checked TagSoup's properties [1] and tried disabling ignoreBogonsFeature 
> that was introduced with TIKA-599. My unit test using the <section> element 
> instead of <p> now passes correctly. However, i cannot build Tika because 
> TestChmExtraction fails [3] and TestChmExtractor runs indefinately so i have 
> to terminate the build.
> 
> It seems to that TagSoup treats the <article> and <section> elements as 
> unknown elements but for some reason it does allow other HTML5 elements such 
> as <dfn> and likely others. What can i do? Is this an issue that should be 
> solved in TagSoup (how)?  Should we make the ignoreBogonsFeature configurable 
> via ParseContext? Other clever ideas?
> 
> Thanks!
> 
> [1]: http://mercury.ccil.org/~cowan/XML/tagsoup/#properties
> [2]: https://issues.apache.org/jira/browse/TIKA-599
> [3]: Running org.apache.tika.parser.chm.TestChmExtraction
> java.lang.NullPointerException
>         at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39)
>         at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
>         at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
>         at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>         at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>         at 
> org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> java.lang.NullPointerException
>         at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39)
>         at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
>         at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
>         at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>         at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
>         at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>         at 
> org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 
>  
> -----Original message-----
> > From:Markus Jelsma <[email protected]>
> > Sent: Wed 29-Aug-2012 13:40
> > To: [email protected]
> > Subject: Article and section tags
> > 
> > Hi,
> > 
> > I'm still testing internet pages for TIKA-980 and to my surprise it cannot 
> > deal with <section> and <element> tags. Whenever i print the tag's name in 
> > startElement i never see those elements and therefore i cannot extract 
> > microdata. Where are those elements going? How can i get them? I use the 
> > IdentityHtmlMapper in the unit test.
> > 
> > Thanks,
> > Markus
> > 
> 

Reply via email to