I checked TagSoup's properties [1] and tried disabling ignoreBogonsFeature that was introduced with TIKA-599. My unit test using the <section> element instead of <p> now passes correctly. However, i cannot build Tika because TestChmExtraction fails [3] and TestChmExtractor runs indefinately so i have to terminate the build.
It seems to that TagSoup treats the <article> and <section> elements as unknown elements but for some reason it does allow other HTML5 elements such as <dfn> and likely others. What can i do? Is this an issue that should be solved in TagSoup (how)? Should we make the ignoreBogonsFeature configurable via ParseContext? Other clever ideas? Thanks! [1]: http://mercury.ccil.org/~cowan/XML/tagsoup/#properties [2]: https://issues.apache.org/jira/browse/TIKA-599 [3]: Running org.apache.tika.parser.chm.TestChmExtraction java.lang.NullPointerException at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39) at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970) at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561) at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104) at org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163) at org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74) at org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141) at org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) java.lang.NullPointerException at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39) at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970) at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561) at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104) at org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163) at org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74) at org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141) at org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Wed 29-Aug-2012 13:40 > To: [email protected] > Subject: Article and section tags > > Hi, > > I'm still testing internet pages for TIKA-980 and to my surprise it cannot > deal with <section> and <element> tags. Whenever i print the tag's name in > startElement i never see those elements and therefore i cannot extract > microdata. Where are those elements going? How can i get them? I use the > IdentityHtmlMapper in the unit test. > > Thanks, > Markus >
