[jira] [Commented] (TIKA-1808) Head section closed too eager

2023-09-26 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769239#comment-17769239 ] Markus Jelsma commented on TIKA-1808: - Ah, i read your message incorrectly. Well, if we come across

[jira] [Commented] (TIKA-1808) Head section closed too eager

2023-09-26 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769232#comment-17769232 ] Markus Jelsma commented on TIKA-1808: - Aah, i am happy to read that some stuff is fixed for free with

[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671452#comment-16671452 ] Markus Jelsma commented on TIKA-2760: - Hello [~davemeikle], Of course! I cannot understand why i did

[jira] [Closed] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed TIKA-2760. --- > LinkContentHandler does not report hyperlinks > - > >

[jira] [Resolved] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved TIKA-2760. - Resolution: Not A Problem > LinkContentHandler does not report hyperlinks >

[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668605#comment-16668605 ] Markus Jelsma commented on TIKA-2760: - Hello [~davemeikle], I cannot get any links using any HTML

[jira] [Commented] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

2018-10-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659320#comment-16659320 ] Markus Jelsma commented on TIKA-2759: - Thanks [~talli...@apache.org]! > ScriptsExtractor incorrectly

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658019#comment-16658019 ] Markus Jelsma commented on TIKA-2758: - [~kkrugler] if you or anyone suspect a change could be

[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655169#comment-16655169 ] Markus Jelsma commented on TIKA-2760: - Patch file only contains a unit test. The expected part of the

[jira] [Updated] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2760: Attachment: TIKA-2760.patch > LinkContentHandler does not report hyperlinks >

[jira] [Updated] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2760: Attachment: ronaldmcdonald-nolinks.html > LinkContentHandler does not report hyperlinks >

[jira] [Created] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-2760: --- Summary: LinkContentHandler does not report hyperlinks Key: TIKA-2760 URL: https://issues.apache.org/jira/browse/TIKA-2760 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-2758) Possible error charset detection

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2758: Description: I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995

[jira] [Updated] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2759: Description: We extract Javascript as text content while instead it is actually a script tag with

[jira] [Updated] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2759: Attachment: petrolicious.html > ScriptsExtractor incorrectly reports Javascript to characters() in

[jira] [Created] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

2018-10-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-2759: --- Summary: ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler Key: TIKA-2759 URL: https://issues.apache.org/jira/browse/TIKA-2759

[jira] [Updated] (TIKA-2758) Possible error charset detection

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2758: Attachment: independent.html > Possible error charset detection >

[jira] [Updated] (TIKA-2758) Possible error charset detection

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2758: Attachment: detroidnews.html > Possible error charset detection >

[jira] [Created] (TIKA-2758) Possible error charset detection

2018-10-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-2758: --- Summary: Possible error charset detection Key: TIKA-2758 URL: https://issues.apache.org/jira/browse/TIKA-2758 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380705#comment-16380705 ] Markus Jelsma commented on TIKA-2576: - I don't know if it is documented but that config file will fix

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350614#comment-16350614 ] Markus Jelsma commented on TIKA-2563: - Ah, thanks :) > Extract embedded files in HTML >

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350604#comment-16350604 ] Markus Jelsma commented on TIKA-2563: - I am not sure if ASL 2.0 friendly would apply. I took it some

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350545#comment-16350545 ] Markus Jelsma commented on TIKA-1599: - On topic, our parser on top of Tika relies on a custom

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350541#comment-16350541 ] Markus Jelsma commented on TIKA-1599: - Tim, if attached file is what you are looking for, i've got

[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1599: Attachment: consumentenbond.html > Switch from TagSoup to JSoup > > >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253524#comment-16253524 ] Markus Jelsma commented on TIKA-2490: - Good enough! Thanks! > Turn off stderr warnings in Tika-app >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253383#comment-16253383 ] Markus Jelsma commented on TIKA-2490: - Ok, so what should we do in Nutch. By default, no

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251338#comment-16251338 ] Markus Jelsma commented on TIKA-2490: - I attached a Nutch patch for upgrading to 1.16, modified to work

[jira] [Updated] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2490: Attachment: NUTCH-2439-1.17.patch > Turn off stderr warnings in Tika-app >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251337#comment-16251337 ] Markus Jelsma commented on TIKA-2490: - I still get: {code} Nov 14, 2017 1:33:11 PM

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240369#comment-16240369 ] Markus Jelsma commented on TIKA-2490: - If you have a patch, of course, feel free to open a ticket! >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240349#comment-16240349 ] Markus Jelsma commented on TIKA-2490: - Yes! > Turn off stderr warnings in Tika-app >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240342#comment-16240342 ] Markus Jelsma commented on TIKA-2490: - No, old Nutch style: {code} tikaConfig = new

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240224#comment-16240224 ] Markus Jelsma commented on TIKA-2490: - Hello [~talli...@mitre.org], that works. But we still see:

[jira] [Updated] (TIKA-2491) Cannot use TikaConfig

2017-11-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-2491: Attachment: tika-config.xml > Cannot use TikaConfig > - > > Key:

[jira] [Created] (TIKA-2491) Cannot use TikaConfig

2017-11-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-2491: --- Summary: Cannot use TikaConfig Key: TIKA-2491 URL: https://issues.apache.org/jira/browse/TIKA-2491 Project: Tika Issue Type: Bug Affects Versions: 1.16

[jira] [Created] (TIKA-2485) HTMLEncodingDetector content limit to be configurable

2017-10-27 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-2485: --- Summary: HTMLEncodingDetector content limit to be configurable Key: TIKA-2485 URL: https://issues.apache.org/jira/browse/TIKA-2485 Project: Tika Issue Type:

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2016-03-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177572#comment-15177572 ] Markus Jelsma commented on TIKA-1782: - Yes i, unfortunately, agree. The unit test i supplied, similar

[jira] [Created] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1835: --- Summary: LinkContentHandler skips iframe and rel tags Key: TIKA-1835 URL: https://issues.apache.org/jira/browse/TIKA-1835 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1835: Flags: Patch,Important (was: Important) > LinkContentHandler skips iframe and rel tags >

[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1835: Attachment: TIKA-1835.patch Patch for trunk. Adds support for iframe and link element link

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048545#comment-15048545 ] Markus Jelsma commented on TIKA-1599: - Hi - i also don't know how hard it would be to support JSoup.

[jira] [Commented] (TIKA-1808) Head section closed too eager

2015-12-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048544#comment-15048544 ] Markus Jelsma commented on TIKA-1808: - Hello Ken - that makes sense indeed, if it is not valid, close

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048784#comment-15048784 ] Markus Jelsma commented on TIKA-1599: - Hello Ken - i would like to believe that ParseContext is ideal

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049308#comment-15049308 ] Markus Jelsma commented on TIKA-1599: - Hello - we rely on Tika for our content extraction framework,

[jira] [Commented] (TIKA-985) Support for HTML5 elements

2015-12-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049317#comment-15049317 ] Markus Jelsma commented on TIKA-985: Hello Tim - there is a unit test in TIKA-980. It relies on this

[jira] [Updated] (TIKA-1808) Head section closed too eager

2015-12-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1808: Description: XHTMLContentHandler has some logic that closes the head section too early, or this is

[jira] [Created] (TIKA-1808) Head section closed too eager

2015-12-08 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1808: --- Summary: Head section closed too eager Key: TIKA-1808 URL: https://issues.apache.org/jira/browse/TIKA-1808 Project: Tika Issue Type: Bug Components:

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-11-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002195#comment-15002195 ] Markus Jelsma commented on TIKA-1782: - Hi - i have no test hanging around but my consumier code

[jira] [Commented] (TIKA-980) MicrodataContentHandler for Apache Tika

2015-11-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002216#comment-15002216 ] Markus Jelsma commented on TIKA-980: Hello Nick - the identity mapper is required because without it,

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-11-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002218#comment-15002218 ] Markus Jelsma commented on TIKA-1782: - Hello Tim, i think there is a test, see TIKA-980. The unit test

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976474#comment-14976474 ] Markus Jelsma commented on TIKA-1782: - Ah, testJPEG() fails independently and has nothing to do with

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976401#comment-14976401 ] Markus Jelsma commented on TIKA-1782: - Hello Tim, is testJPEG's failure unrelated to this change? >

[jira] [Updated] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1782: Attachment: TIKA-1782.patch Patch for trunk, ImageParserTest fails,

[jira] [Created] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-26 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1782: --- Summary: XHTMLContentHandler doesn't pass attributes of html element Key: TIKA-1782 URL: https://issues.apache.org/jira/browse/TIKA-1782 Project: Tika Issue

[jira] [Commented] (TIKA-1782) XHTMLContentHandler doesn't pass attributes of html element

2015-10-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974439#comment-14974439 ] Markus Jelsma commented on TIKA-1782: - Hello - this is on 1.8.0_40 and on Ubuntu 14.10 openjdk version

[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-12-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851648#comment-13851648 ] Markus Jelsma commented on TIKA-1193: - Hi - does this new patch need some adjustments?

[jira] [Updated] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1193: Attachment: TIKA-1193-trunk.patch Yes, i agree. Here's a new patch plus unit test using a

[jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825823#comment-13825823 ] Markus Jelsma commented on TIKA-1193: - Hi- are there any objections to putting this in?

[jira] [Created] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1193: --- Summary: Allow access to HtmlParser's HtmlSchema Key: TIKA-1193 URL: https://issues.apache.org/jira/browse/TIKA-1193 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

2013-11-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1193: Attachment: TIKA-1193-trunk.patch Patch for trunk. Allow access to HtmlParser's HtmlSchema

[jira] [Commented] (TIKA-676) Boilerpipe fails

2013-10-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789742#comment-13789742 ] Markus Jelsma commented on TIKA-676: That would be cool, but it would be great if he

[jira] [Commented] (TIKA-676) Boilerpipe fails

2013-10-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789746#comment-13789746 ] Markus Jelsma commented on TIKA-676: Oh, i checked. None of my open issues are directly

[jira] [Commented] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2013-08-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749944#comment-13749944 ] Markus Jelsma commented on TIKA-961: Any change this one is going to be committed?

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2013-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-985: --- Attachment: TIKA-985-1.5.patch Dirty patch for Tika 1.5. This patch allows for headings (h1...h6) to

[jira] [Resolved] (TIKA-992) OpenGraph meta tags to allow multiple values

2013-05-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved TIKA-992. Resolution: Fixed Thanks Dave. Marked as resolved. OpenGraph meta tags to allow

[jira] [Commented] (TIKA-992) OpenGraph meta tags to allow multiple values

2013-05-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656037#comment-13656037 ] Markus Jelsma commented on TIKA-992: Hi Kiran - this patch works for any meta tag that

[jira] [Commented] (TIKA-992) OpenGraph meta tags to allow multiple values

2013-05-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656043#comment-13656043 ] Markus Jelsma commented on TIKA-992: BTW, any reason why this is still not committed?

[jira] [Created] (TIKA-1009) Expose TextDocument in BoilerpipeContentHandler

2012-10-17 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1009: --- Summary: Expose TextDocument in BoilerpipeContentHandler Key: TIKA-1009 URL: https://issues.apache.org/jira/browse/TIKA-1009 Project: Tika Issue Type:

[jira] [Updated] (TIKA-1009) Expose TextDocument in BoilerpipeContentHandler

2012-10-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1009: Attachment: TIKA-1009-1.3-1.patch Patch adding the getTextDocument() method to the

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2012-10-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-980: --- Attachment: TIKA-980-1.3-4.patch Here's a new patch. It allows to find nested structures and still

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2012-10-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-985: --- Attachment: TIKA-985-1.3-3.patch Here's a new patch. It allows metadata to be read from within the

[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2012-09-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-995: --- Attachment: TIKA-995-1.3-1.patch Here's a quick fix. If the body is removed from the AUTO Set all

[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2012-09-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-995: --- Attachment: TIKA-995-unit.patch Here's a unit test. XHTMLContentHandler doesn't

[jira] [Created] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2012-09-21 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-995: -- Summary: XHTMLContentHandler doesn't pass attributes of body element Key: TIKA-995 URL: https://issues.apache.org/jira/browse/TIKA-995 Project: Tika Issue

[jira] [Created] (TIKA-992) OpenGraph meta tags to allow multiple values

2012-09-04 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-992: -- Summary: OpenGraph meta tags to allow multiple values Key: TIKA-992 URL: https://issues.apache.org/jira/browse/TIKA-992 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-992) OpenGraph meta tags to allow multiple values

2012-09-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-992: --- Attachment: TIKA-992-1.3-1.patch Here's a patch improving the unit test and relies on Metadata.add()

[jira] [Created] (TIKA-985) Support for HTML5 elements

2012-08-30 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-985: -- Summary: Support for HTML5 elements Key: TIKA-985 URL: https://issues.apache.org/jira/browse/TIKA-985 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2012-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-985: --- Attachment: TIKA-985-1.3-1.patch Here's a preliminary patch for 1.3. It adds some HTML5 elements to

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2012-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-985: --- Attachment: TIKA-985-1.3-2.patch Here's a new patch listing all HTML5 elements that are missing in the

[jira] [Commented] (TIKA-980) MicrodataContentHandler for Apache Tika

2012-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445415#comment-13445415 ] Markus Jelsma commented on TIKA-980: No, the Any23 parser is DOM-based and the

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2012-08-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-980: --- Attachment: TIKA-980-1.3-3.patch Here's a new patch trimming and removing excess whitespace from

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2012-08-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-980: --- Attachment: TIKA-980-1.3-2.patch - improved itemprop attribute handling - moved package to

[jira] [Updated] (TIKA-975) LinkBuilder to optionally collapse anchor whitespace

2012-08-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-975: --- Attachment: TIKA-975-1.3-2.patch Here's a new patch with a unit test. LinkBuilder to

[jira] [Updated] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-08-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-961: --- Attachment: TIKA-961-1.3-2.patch Here's a new patch with unit test. The test breaks when checking for

[jira] [Updated] (TIKA-975) LinkBuilder to optionally collapse anchor whitespace

2012-08-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-975: --- Attachment: TIKA-975-1.3-1.patch Here's a patch for trunk. LinkBuilder to optionally

[jira] [Commented] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-08-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435139#comment-13435139 ] Markus Jelsma commented on TIKA-961: Browsing through the code i believe we can

[jira] [Commented] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-08-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1343#comment-1343 ] Markus Jelsma commented on TIKA-961: Ken, I'll see if i can provide a test but i'd

[jira] [Created] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-07-30 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-961: -- Summary: No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true) Key: TIKA-961 URL: https://issues.apache.org/jira/browse/TIKA-961 Project: Tika

[jira] [Updated] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-07-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-961: --- Attachment: TIKA-961-1.3-1.patch Patch for 1.3 adding ignorableWhitespace if the last character is no

[jira] [Comment Edited] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2012-07-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424866#comment-13424866 ] Markus Jelsma edited comment on TIKA-961 at 7/30/12 2:03 PM: -

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-08-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089408#comment-13089408 ] Markus Jelsma commented on TIKA-676: Makes sense, thanks! Boilerpipe fails

[jira] [Commented] (TIKA-648) Parsing HTML anchors with embedded div faulty

2011-08-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089410#comment-13089410 ] Markus Jelsma commented on TIKA-648: Thanks. I assume this is not something that needs

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086370#comment-13086370 ] Markus Jelsma commented on TIKA-676: Is this going to be integrated with Tika 1.0? Is

[jira] [Updated] (TIKA-648) Parsing HTML anchors with embedded div faulty

2011-08-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-648: --- Fix Version/s: 1.0 Parsing HTML anchors with embedded div faulty

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065220#comment-13065220 ] Markus Jelsma commented on TIKA-676: Good work! Upgrading to BoilerPipe 1.2.0 fixes the

[jira] [Created] (TIKA-648) Parsing HTML anchors with embedded div faulty

2011-04-26 Thread Markus Jelsma (JIRA)
Parsing HTML anchors with embedded div faulty - Key: TIKA-648 URL: https://issues.apache.org/jira/browse/TIKA-648 Project: Tika Issue Type: Bug Components: parser Affects Versions: