Re: Tika 1.14?

2016-08-12 Thread lewis john mcgibbney
Good thread Tim, Regarding open issues and low hanging fruit to make it into 1.14, I will also work on finishing https://github.com/apache/tika/pull/112. I think Bob has an excellent point. The 2.X work is major and would be a big step in the right direction. Having both branches longer and longer

[jira] [Comment Edited] (TIKA-2013) Upgrade to POI 3.15-beta3 when available

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419274#comment-15419274 ] Tim Allison edited comment on TIKA-2013 at 8/12/16 6:59 PM: I compared Tika

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418855#comment-15418855 ] Tim Allison edited comment on TIKA-2038 at 8/12/16 6:40 PM: bq. But since I

[jira] [Comment Edited] (TIKA-2013) Upgrade to POI 3.15-beta3 when available

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419274#comment-15419274 ] Tim Allison edited comment on TIKA-2013 at 8/12/16 6:36 PM: I compared Tika

[jira] [Updated] (TIKA-2013) Upgrade to POI 3.15-beta3 when available

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2013: -- Attachment: potential_regressions_poi_3_15-beta3.zip I compared Tika with poi-3.15-beta1 vs the

[jira] [Updated] (TIKA-2013) Upgrade to POI 3.15-beta3 when available

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2013: -- Summary: Upgrade to POI 3.15-beta3 when available (was: Upgrade to POI 3.15-beta2 when available) >

[jira] [Commented] (TIKA-1938) HtmlParser drops

2016-08-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419202#comment-15419202 ] Hudson commented on TIKA-1938: -- SUCCESS: Integrated in tika-2.x #130 (See

[jira] [Commented] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419203#comment-15419203 ] Hudson commented on TIKA-1980: -- SUCCESS: Integrated in tika-2.x #130 (See

[jira] [Commented] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419150#comment-15419150 ] Hudson commented on TIKA-1980: -- FAILURE: Integrated in tika-2.x-windows #34 (See

[jira] [Commented] (TIKA-1938) HtmlParser drops

2016-08-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419149#comment-15419149 ] Hudson commented on TIKA-1938: -- FAILURE: Integrated in tika-2.x-windows #34 (See

tika-2.x-windows - Build # 34 - Still Failing

2016-08-12 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #34) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/34/ to view the results.

[jira] [Commented] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419104#comment-15419104 ] Hudson commented on TIKA-1980: -- SUCCESS: Integrated in Tika-trunk #1091 (See

[jira] [Commented] (TIKA-1938) HtmlParser drops

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419101#comment-15419101 ] Tim Allison commented on TIKA-1938: --- I just applied this to 2.x. > HtmlParser drops elements found

[jira] [Updated] (TIKA-1938) HtmlParser drops

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1938: -- Fix Version/s: 2.0 > HtmlParser drops elements found inside >

[jira] [Resolved] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1980. --- Resolution: Fixed Fix Version/s: 1.14 2.0 Thank you, [~naegelejd]! > HTML

[GitHub] tika pull request #121: fix for TIKA-1980 contributed by naegelejd

2016-08-12 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/121 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[jira] [Commented] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419059#comment-15419059 ] ASF GitHub Bot commented on TIKA-1980: -- Github user asfgit closed the pull request at:

[jira] [Assigned] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1980: - Assignee: Tim Allison > HTML head tags found after first script not parsed by HtmlParser

[jira] [Commented] (TIKA-1980) HTML head tags found after first script not parsed by HtmlParser (regression)

2016-08-12 Thread Joseph Naegele (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418959#comment-15418959 ] Joseph Naegele commented on TIKA-1980: -- This should absolutely make it into 1.14. > HTML head tags

Re: Tika 1.14?

2016-08-12 Thread Chris Mattmann
1508, and 1680 are pending me/my review. I’ll get it done today. On 8/12/16, 4:24 AM, "Allison, Timothy B." wrote: >> I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418855#comment-15418855 ] Tim Allison edited comment on TIKA-2038 at 8/12/16 1:51 PM: bq. But since I

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418855#comment-15418855 ] Tim Allison commented on TIKA-2038: --- bq. But since I haven’t access to a broadband Internet connection

[jira] [Commented] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418760#comment-15418760 ] Tim Allison commented on TIKA-2054: --- You might try subclassing the XHTMLHandler/SafeContentHandler and

[jira] [Commented] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

2016-08-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418755#comment-15418755 ] Tim Allison commented on TIKA-2054: --- I don't think we want to modify our SafeContentHandler to stop

RE: Tika 1.14?

2016-08-12 Thread Luís Filipe Nassif
I think waiting for pdfbox 2.0.3 would be great. There are some regressions fixed. Regards, Luis Em 12 de ago de 2016 08:24, "Allison, Timothy B." escreveu: > >> I know it's been a little bit since we talked about 2.0. We had > discussed holding off while some API changes

[jira] [Updated] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

2016-08-12 Thread Angela Onslow (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Angela Onslow updated TIKA-2054: Attachment: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf Here is a file which demonstrates this

[jira] [Comment Edited] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

2016-08-12 Thread Angela Onslow (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418723#comment-15418723 ] Angela Onslow edited comment on TIKA-2054 at 8/12/16 11:48 AM: --- Here is a

[jira] [Created] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

2016-08-12 Thread Angela Onslow (JIRA)
Angela Onslow created TIKA-2054: --- Summary: Problem with ligatures converting from PDF to HTML with Tika Key: TIKA-2054 URL: https://issues.apache.org/jira/browse/TIKA-2054 Project: Tika Issue

RE: Tika 1.14?

2016-08-12 Thread Allison, Timothy B.
>> I know it's been a little bit since we talked about 2.0. We had discussed >> holding off while some API changes that were under consideration. Has any >> progress been made on this? > I think we're still trying to come up with a plan for how to allow multiple > parsers to report text for

Re: Tika 1.14?

2016-08-12 Thread Ray Gauss
I believe we've also still got the issue of structured metadata outstanding. Regards, Ray > On Aug 12, 2016, at 6:27 AM, Nick Burch wrote: > > On Thu, 11 Aug 2016, Bob Paulin wrote: >> I know it's been a little bit since we talked about 2.0. We had discussed >> holding

Re: Tika 1.14?

2016-08-12 Thread Nick Burch
On Thu, 11 Aug 2016, Bob Paulin wrote: I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? I think we're still trying to come up with a plan for how to allow multiple

[jira] [Commented] (TIKA-2053) Adding TagRatio to Tika Parser

2016-08-12 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418450#comment-15418450 ] ASF GitHub Bot commented on TIKA-2053: -- GitHub user AravindRam opened a pull request:

[GitHub] tika pull request #130: fix for TIKA-2053 contributed by AravindRam

2016-08-12 Thread AravindRam
GitHub user AravindRam opened a pull request: https://github.com/apache/tika/pull/130 fix for TIKA-2053 contributed by AravindRam Adding TagRatio parser to Tika Parser. You can merge this pull request into a Git repository by running: $ git pull

Re: Tika 1.14?

2016-08-12 Thread Bob Paulin
I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? The community has been really good about dual maintaining but how much longer do we want to have this expectation?

[jira] [Created] (TIKA-2053) Adding TagRatio to Tika Parser

2016-08-12 Thread Aravind Ram Nathan (JIRA)
Aravind Ram Nathan created TIKA-2053: Summary: Adding TagRatio to Tika Parser Key: TIKA-2053 URL: https://issues.apache.org/jira/browse/TIKA-2053 Project: Tika Issue Type: New Feature