[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350330#comment-16350330 ] Tim Allison commented on TIKA-2562: --- This is a "feature" of tagsoup see, e.g.

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350418#comment-16350418 ] Andrei Rebegea commented on TIKA-2490: -- Hello, I am using tika version 1.17 and still getting these

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350335#comment-16350335 ] Tim Allison commented on TIKA-1599: --- What say we do a fresh eval on our current corpus and then do a

[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350302#comment-16350302 ] Tim Allison commented on TIKA-2561: --- This is helpful.  It boggles my imagination that this could be a

[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350335#comment-16350335 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 1:42 PM: --- What say we do a

[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350406#comment-16350406 ] Hudson commented on TIKA-2561: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1429 (See

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350504#comment-16350504 ] Luis Filipe Nassif commented on TIKA-1599: -- Hi [~talli...@mitre.org], Moving to DOM could lead to

[jira] [Resolved] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2561. --- Resolution: Fixed Fix Version/s: 1.18 2.0 > Tika Parser includes

[jira] [Updated] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2490: -- Fix Version/s: 1.17 > Turn off stderr warnings in Tika-app > > >

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350420#comment-16350420 ] Tim Allison commented on TIKA-2490: --- No, this isn't supposed to happen if you use the example

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350528#comment-16350528 ] Andrei Rebegea commented on TIKA-2490: -- The short answer is that I don't know the full details, sorry.

[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1599: Attachment: consumentenbond.html > Switch from TagSoup to JSoup > > >

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350556#comment-16350556 ] Tim Allison commented on TIKA-1599: --- >Tim, if attached file is what you are looking for, i've got about

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350606#comment-16350606 ] Tim Allison commented on TIKA-2563: --- Right.  Sorry.  I meant the {{testHTML_embedded_img.html}}, NOT the

[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634 ] NW Brad edited comment on TIKA-2562 at 2/2/18 4:51 PM: --- Thanks.  I checked it out and

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350545#comment-16350545 ] Markus Jelsma commented on TIKA-1599: - On topic, our parser on top of Tika relies on a custom

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350543#comment-16350543 ] Andrei Rebegea commented on TIKA-2490: -- OK. Thanks for the answer. So "Is this still suppose to

[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634 ] NW Brad edited comment on TIKA-2562 at 2/2/18 4:50 PM: --- Thanks.  I checked it out and

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350538#comment-16350538 ] Tim Allison commented on TIKA-2490: --- Whoa, welcome to modernity. :) >I am assuming that just by

[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Attachment: consumentenbond.html > Extract embedded files in HTML > -- > >

[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634 ] NW Brad commented on TIKA-2562: --- Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350529#comment-16350529 ] Tim Allison commented on TIKA-1599: --- >DOM could lead to higher memory usage Y, that's my concern, esp

[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350529#comment-16350529 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 3:39 PM: --- >DOM could lead to

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350550#comment-16350550 ] Tim Allison commented on TIKA-2490: --- Y, sorry.  We could change this behavior back to ignore missing

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350604#comment-16350604 ] Markus Jelsma commented on TIKA-2563: - I am not sure if ASL 2.0 friendly would apply. I took it some

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350541#comment-16350541 ] Markus Jelsma commented on TIKA-1599: - Tim, if attached file is what you are looking for, i've got

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350593#comment-16350593 ] Tim Allison commented on TIKA-2563: --- ASF 2.0 friendly example file based on example file kindly supplied

[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Attachment: testHTML_embedded_img.html > Extract embedded files in HTML > --

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350614#comment-16350614 ] Markus Jelsma commented on TIKA-2563: - Ah, thanks :) > Extract embedded files in HTML >

[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Description: Files (esp images) and other objects can be embedded in html/css/javascript with the

[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Description: Files (esp images) and other objects can be embedded in html/css/javascript with the [data:

[jira] [Created] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2563: - Summary: Extract embedded files in HTML Key: TIKA-2563 URL: https://issues.apache.org/jira/browse/TIKA-2563 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350555#comment-16350555 ] Tim Allison commented on TIKA-2563: --- Attached example file supplied by [~markus17] on TIKA-1599.  Thank

[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350822#comment-16350822 ] Tim Allison commented on TIKA-2562: --- I'll take a look.  This will require some digging. > tika server

[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351224#comment-16351224 ] NW Brad commented on TIKA-2562: --- I was doing some research on this today and this may not be a function of

[jira] [Created] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11

2018-02-02 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-2565: -- Summary: Upgrade edu.ucar dependencies to 4.6.11 Key: TIKA-2565 URL: https://issues.apache.org/jira/browse/TIKA-2565 Project: Tika Issue Type:

[jira] [Created] (TIKA-2564) Tika client cannot extract files from embedded archive formats

2018-02-02 Thread Marc Prud'hommeaux (JIRA)
Marc Prud'hommeaux created TIKA-2564: Summary: Tika client cannot extract files from embedded archive formats Key: TIKA-2564 URL: https://issues.apache.org/jira/browse/TIKA-2564 Project: Tika