[
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769239#comment-17769239
]
Markus Jelsma commented on TIKA-1808:
-
Ah, i read your message incorrectly. Well, if we come across
[
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769232#comment-17769232
]
Markus Jelsma commented on TIKA-1808:
-
Aah, i am happy to read that some stuff is fixed for free with
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671452#comment-16671452
]
Markus Jelsma commented on TIKA-2760:
-
Hello [~davemeikle],
Of course! I cannot understand why i did
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed TIKA-2760.
---
> LinkContentHandler does not report hyperlinks
> -
>
>
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved TIKA-2760.
-
Resolution: Not A Problem
> LinkContentHandler does not report hyperlinks
>
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668605#comment-16668605
]
Markus Jelsma commented on TIKA-2760:
-
Hello [~davemeikle],
I cannot get any links using any HTML
[
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659320#comment-16659320
]
Markus Jelsma commented on TIKA-2759:
-
Thanks [~talli...@apache.org]!
> ScriptsExtractor incorrectly
[
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658019#comment-16658019
]
Markus Jelsma commented on TIKA-2758:
-
[~kkrugler] if you or anyone suspect a change could be
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655169#comment-16655169
]
Markus Jelsma commented on TIKA-2760:
-
Patch file only contains a unit test. The expected part of the
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2760:
Attachment: TIKA-2760.patch
> LinkContentHandler does not report hyperlinks
>
[
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2760:
Attachment: ronaldmcdonald-nolinks.html
> LinkContentHandler does not report hyperlinks
>
Markus Jelsma created TIKA-2760:
---
Summary: LinkContentHandler does not report hyperlinks
Key: TIKA-2760
URL: https://issues.apache.org/jira/browse/TIKA-2760
Project: Tika
Issue Type: Bug
[
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2758:
Description:
I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all
995
[
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2759:
Description:
We extract Javascript as text content while instead it is actually a script tag
with
[
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2759:
Attachment: petrolicious.html
> ScriptsExtractor incorrectly reports Javascript to characters() in
Markus Jelsma created TIKA-2759:
---
Summary: ScriptsExtractor incorrectly reports Javascript to
characters() in SAX ContentHandler
Key: TIKA-2759
URL: https://issues.apache.org/jira/browse/TIKA-2759
[
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2758:
Attachment: independent.html
> Possible error charset detection
>
[
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2758:
Attachment: detroidnews.html
> Possible error charset detection
>
Markus Jelsma created TIKA-2758:
---
Summary: Possible error charset detection
Key: TIKA-2758
URL: https://issues.apache.org/jira/browse/TIKA-2758
Project: Tika
Issue Type: Bug
[
https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380705#comment-16380705
]
Markus Jelsma commented on TIKA-2576:
-
I don't know if it is documented but that config file will fix
[
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350614#comment-16350614
]
Markus Jelsma commented on TIKA-2563:
-
Ah, thanks :)
> Extract embedded files in HTML
>
[
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350604#comment-16350604
]
Markus Jelsma commented on TIKA-2563:
-
I am not sure if ASL 2.0 friendly would apply. I took it some
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350545#comment-16350545
]
Markus Jelsma commented on TIKA-1599:
-
On topic, our parser on top of Tika relies on a custom
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350541#comment-16350541
]
Markus Jelsma commented on TIKA-1599:
-
Tim, if attached file is what you are looking for, i've got
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1599:
Attachment: consumentenbond.html
> Switch from TagSoup to JSoup
>
>
>
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253524#comment-16253524
]
Markus Jelsma commented on TIKA-2490:
-
Good enough! Thanks!
> Turn off stderr warnings in Tika-app
>
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253383#comment-16253383
]
Markus Jelsma commented on TIKA-2490:
-
Ok, so what should we do in Nutch. By default, no
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251338#comment-16251338
]
Markus Jelsma commented on TIKA-2490:
-
I attached a Nutch patch for upgrading to 1.16, modified to work
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2490:
Attachment: NUTCH-2439-1.17.patch
> Turn off stderr warnings in Tika-app
>
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251337#comment-16251337
]
Markus Jelsma commented on TIKA-2490:
-
I still get:
{code}
Nov 14, 2017 1:33:11 PM
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240369#comment-16240369
]
Markus Jelsma commented on TIKA-2490:
-
If you have a patch, of course, feel free to open a ticket!
>
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240349#comment-16240349
]
Markus Jelsma commented on TIKA-2490:
-
Yes!
> Turn off stderr warnings in Tika-app
>
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240342#comment-16240342
]
Markus Jelsma commented on TIKA-2490:
-
No, old Nutch style:
{code}
tikaConfig = new
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240224#comment-16240224
]
Markus Jelsma commented on TIKA-2490:
-
Hello [~talli...@mitre.org], that works. But we still see:
[
https://issues.apache.org/jira/browse/TIKA-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-2491:
Attachment: tika-config.xml
> Cannot use TikaConfig
> -
>
> Key:
Markus Jelsma created TIKA-2491:
---
Summary: Cannot use TikaConfig
Key: TIKA-2491
URL: https://issues.apache.org/jira/browse/TIKA-2491
Project: Tika
Issue Type: Bug
Affects Versions: 1.16
Markus Jelsma created TIKA-2485:
---
Summary: HTMLEncodingDetector content limit to be configurable
Key: TIKA-2485
URL: https://issues.apache.org/jira/browse/TIKA-2485
Project: Tika
Issue Type:
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177572#comment-15177572
]
Markus Jelsma commented on TIKA-1782:
-
Yes i, unfortunately, agree. The unit test i supplied, similar
Markus Jelsma created TIKA-1835:
---
Summary: LinkContentHandler skips iframe and rel tags
Key: TIKA-1835
URL: https://issues.apache.org/jira/browse/TIKA-1835
Project: Tika
Issue Type: Bug
[
https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1835:
Flags: Patch,Important (was: Important)
> LinkContentHandler skips iframe and rel tags
>
[
https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1835:
Attachment: TIKA-1835.patch
Patch for trunk. Adds support for iframe and link element link
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048545#comment-15048545
]
Markus Jelsma commented on TIKA-1599:
-
Hi - i also don't know how hard it would be to support JSoup.
[
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048544#comment-15048544
]
Markus Jelsma commented on TIKA-1808:
-
Hello Ken - that makes sense indeed, if it is not valid, close
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048784#comment-15048784
]
Markus Jelsma commented on TIKA-1599:
-
Hello Ken - i would like to believe that ParseContext is ideal
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049308#comment-15049308
]
Markus Jelsma commented on TIKA-1599:
-
Hello - we rely on Tika for our content extraction framework,
[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049317#comment-15049317
]
Markus Jelsma commented on TIKA-985:
Hello Tim - there is a unit test in TIKA-980. It relies on this
[
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1808:
Description:
XHTMLContentHandler has some logic that closes the head section too early, or
this is
Markus Jelsma created TIKA-1808:
---
Summary: Head section closed too eager
Key: TIKA-1808
URL: https://issues.apache.org/jira/browse/TIKA-1808
Project: Tika
Issue Type: Bug
Components:
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002195#comment-15002195
]
Markus Jelsma commented on TIKA-1782:
-
Hi - i have no test hanging around but my consumier code
[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002216#comment-15002216
]
Markus Jelsma commented on TIKA-980:
Hello Nick - the identity mapper is required because without it,
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002218#comment-15002218
]
Markus Jelsma commented on TIKA-1782:
-
Hello Tim, i think there is a test, see TIKA-980. The unit test
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976474#comment-14976474
]
Markus Jelsma commented on TIKA-1782:
-
Ah, testJPEG() fails independently and has nothing to do with
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976401#comment-14976401
]
Markus Jelsma commented on TIKA-1782:
-
Hello Tim, is testJPEG's failure unrelated to this change?
>
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1782:
Attachment: TIKA-1782.patch
Patch for trunk, ImageParserTest fails,
Markus Jelsma created TIKA-1782:
---
Summary: XHTMLContentHandler doesn't pass attributes of html
element
Key: TIKA-1782
URL: https://issues.apache.org/jira/browse/TIKA-1782
Project: Tika
Issue
[
https://issues.apache.org/jira/browse/TIKA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974439#comment-14974439
]
Markus Jelsma commented on TIKA-1782:
-
Hello - this is on 1.8.0_40 and on Ubuntu 14.10
openjdk version
[
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851648#comment-13851648
]
Markus Jelsma commented on TIKA-1193:
-
Hi - does this new patch need some adjustments?
[
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1193:
Attachment: TIKA-1193-trunk.patch
Yes, i agree. Here's a new patch plus unit test using a
[
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825823#comment-13825823
]
Markus Jelsma commented on TIKA-1193:
-
Hi- are there any objections to putting this in?
Markus Jelsma created TIKA-1193:
---
Summary: Allow access to HtmlParser's HtmlSchema
Key: TIKA-1193
URL: https://issues.apache.org/jira/browse/TIKA-1193
Project: Tika
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1193:
Attachment: TIKA-1193-trunk.patch
Patch for trunk.
Allow access to HtmlParser's HtmlSchema
[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789742#comment-13789742
]
Markus Jelsma commented on TIKA-676:
That would be cool, but it would be great if he
[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789746#comment-13789746
]
Markus Jelsma commented on TIKA-676:
Oh, i checked. None of my open issues are directly
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749944#comment-13749944
]
Markus Jelsma commented on TIKA-961:
Any change this one is going to be committed?
[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-985:
---
Attachment: TIKA-985-1.5.patch
Dirty patch for Tika 1.5. This patch allows for headings (h1...h6) to
[
https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved TIKA-992.
Resolution: Fixed
Thanks Dave. Marked as resolved.
OpenGraph meta tags to allow
[
https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656037#comment-13656037
]
Markus Jelsma commented on TIKA-992:
Hi Kiran - this patch works for any meta tag that
[
https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656043#comment-13656043
]
Markus Jelsma commented on TIKA-992:
BTW, any reason why this is still not committed?
Markus Jelsma created TIKA-1009:
---
Summary: Expose TextDocument in BoilerpipeContentHandler
Key: TIKA-1009
URL: https://issues.apache.org/jira/browse/TIKA-1009
Project: Tika
Issue Type:
[
https://issues.apache.org/jira/browse/TIKA-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-1009:
Attachment: TIKA-1009-1.3-1.patch
Patch adding the getTextDocument() method to the
[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-980:
---
Attachment: TIKA-980-1.3-4.patch
Here's a new patch. It allows to find nested structures and still
[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-985:
---
Attachment: TIKA-985-1.3-3.patch
Here's a new patch. It allows metadata to be read from within the
[
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-995:
---
Attachment: TIKA-995-1.3-1.patch
Here's a quick fix. If the body is removed from the AUTO Set all
[
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-995:
---
Attachment: TIKA-995-unit.patch
Here's a unit test.
XHTMLContentHandler doesn't
Markus Jelsma created TIKA-995:
--
Summary: XHTMLContentHandler doesn't pass attributes of body
element
Key: TIKA-995
URL: https://issues.apache.org/jira/browse/TIKA-995
Project: Tika
Issue
Markus Jelsma created TIKA-992:
--
Summary: OpenGraph meta tags to allow multiple values
Key: TIKA-992
URL: https://issues.apache.org/jira/browse/TIKA-992
Project: Tika
Issue Type: Bug
[
https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-992:
---
Attachment: TIKA-992-1.3-1.patch
Here's a patch improving the unit test and relies on Metadata.add()
Markus Jelsma created TIKA-985:
--
Summary: Support for HTML5 elements
Key: TIKA-985
URL: https://issues.apache.org/jira/browse/TIKA-985
Project: Tika
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-985:
---
Attachment: TIKA-985-1.3-1.patch
Here's a preliminary patch for 1.3. It adds some HTML5 elements to
[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-985:
---
Attachment: TIKA-985-1.3-2.patch
Here's a new patch listing all HTML5 elements that are missing in the
[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445415#comment-13445415
]
Markus Jelsma commented on TIKA-980:
No, the Any23 parser is DOM-based and the
[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-980:
---
Attachment: TIKA-980-1.3-3.patch
Here's a new patch trimming and removing excess whitespace from
[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-980:
---
Attachment: TIKA-980-1.3-2.patch
- improved itemprop attribute handling
- moved package to
[
https://issues.apache.org/jira/browse/TIKA-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-975:
---
Attachment: TIKA-975-1.3-2.patch
Here's a new patch with a unit test.
LinkBuilder to
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-961:
---
Attachment: TIKA-961-1.3-2.patch
Here's a new patch with unit test. The test breaks when checking for
[
https://issues.apache.org/jira/browse/TIKA-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-975:
---
Attachment: TIKA-975-1.3-1.patch
Here's a patch for trunk.
LinkBuilder to optionally
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435139#comment-13435139
]
Markus Jelsma commented on TIKA-961:
Browsing through the code i believe we can
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1343#comment-1343
]
Markus Jelsma commented on TIKA-961:
Ken,
I'll see if i can provide a test but i'd
Markus Jelsma created TIKA-961:
--
Summary: No whitespace added if
BoilerpipeContentHandler.setIncludeMarkup(true)
Key: TIKA-961
URL: https://issues.apache.org/jira/browse/TIKA-961
Project: Tika
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-961:
---
Attachment: TIKA-961-1.3-1.patch
Patch for 1.3 adding ignorableWhitespace if the last character is no
[
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424866#comment-13424866
]
Markus Jelsma edited comment on TIKA-961 at 7/30/12 2:03 PM:
-
[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089408#comment-13089408
]
Markus Jelsma commented on TIKA-676:
Makes sense, thanks!
Boilerpipe fails
[
https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089410#comment-13089410
]
Markus Jelsma commented on TIKA-648:
Thanks. I assume this is not something that needs
[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086370#comment-13086370
]
Markus Jelsma commented on TIKA-676:
Is this going to be integrated with Tika 1.0? Is
[
https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated TIKA-648:
---
Fix Version/s: 1.0
Parsing HTML anchors with embedded div faulty
[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065220#comment-13065220
]
Markus Jelsma commented on TIKA-676:
Good work! Upgrading to BoilerPipe 1.2.0 fixes the
Parsing HTML anchors with embedded div faulty
-
Key: TIKA-648
URL: https://issues.apache.org/jira/browse/TIKA-648
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions:
97 matches
Mail list logo