I checked TagSoup's properties [1] and tried disabling ignoreBogonsFeature that 
was introduced with TIKA-599. My unit test using the <section> element instead 
of <p> now passes correctly. However, i cannot build Tika because 
TestChmExtraction fails [3] and TestChmExtractor runs indefinately so i have to 
terminate the build.

It seems to that TagSoup treats the <article> and <section> elements as unknown 
elements but for some reason it does allow other HTML5 elements such as <dfn> 
and likely others. What can i do? Is this an issue that should be solved in 
TagSoup (how)?  Should we make the ignoreBogonsFeature configurable via 
ParseContext? Other clever ideas?

Thanks!

[1]: http://mercury.ccil.org/~cowan/XML/tagsoup/#properties
[2]: https://issues.apache.org/jira/browse/TIKA-599
[3]: Running org.apache.tika.parser.chm.TestChmExtraction
java.lang.NullPointerException
        at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39)
        at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
        at 
org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
java.lang.NullPointerException
        at org.ccil.cowan.tagsoup.Element.<init>(Element.java:39)
        at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
        at 
org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
        at 
org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wed 29-Aug-2012 13:40
> To: [email protected]
> Subject: Article and section tags
> 
> Hi,
> 
> I'm still testing internet pages for TIKA-980 and to my surprise it cannot 
> deal with <section> and <element> tags. Whenever i print the tag's name in 
> startElement i never see those elements and therefore i cannot extract 
> microdata. Where are those elements going? How can i get them? I use the 
> IdentityHtmlMapper in the unit test.
> 
> Thanks,
> Markus
> 

Reply via email to