[jira] [Created] (TIKA-2596) Make PDF2XHTML and AbstractPDF2XHTML public classes

2018-03-01 Thread Kyle Dent (JIRA)
Kyle Dent created TIKA-2596: --- Summary: Make PDF2XHTML and AbstractPDF2XHTML public classes Key: TIKA-2596 URL: https://issues.apache.org/jira/browse/TIKA-2596 Project: Tika Issue Type: Improvement

Re: Tika 1.18?

2018-03-01 Thread Luís Filipe Nassif
I think we should workaround TIKA-2591, and I would like to work on TIKA-1466 (what do you think?) and fix TIKA-2568. Cheers, Luis Livre de vírus. www.avast.com

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382346#comment-16382346 ] Tim Allison commented on TIKA-2592: --- Kicked off grep an hour ago. :) I'm not sure we should make a rule

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382330#comment-16382330 ] Ken Krugler commented on TIKA-2592: --- Before making this kind of change (default "unicode" to UTF-8), 

[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220 ] Md edited comment on TIKA-2593 at 3/1/18 4:51 PM: -- I would like to do few things *

[jira] [Commented] (TIKA-2595) If source build creates 6 jars, why aren't all 6 binaries available for download?

2018-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382272#comment-16382272 ] Nick Burch commented on TIKA-2595: -- Binaries of all Apache Tika jars are available from Maven central The

Re: Tika 1.18?

2018-03-01 Thread Chris Mattmann
Same: makes perfect sense to me and let's do it ( I just updated (finally) Tika Python down stream to be based on the 1.16 Tika, I guess I should get it based on 1.17 soon too ( https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#L17 Cheers, Chris On 3/1/18, 5:16 AM,

[jira] [Created] (TIKA-2595) If source build creates 6 jars, why aren't all 6 binaries available for download?

2018-03-01 Thread Andrew Pavlin (JIRA)
Andrew Pavlin created TIKA-2595: --- Summary: If source build creates 6 jars, why aren't all 6 binaries available for download? Key: TIKA-2595 URL: https://issues.apache.org/jira/browse/TIKA-2595 Project:

Re: Unnecessary WARNING Logging?

2018-03-01 Thread lewis john mcgibbney
Thank you Nick, I was not aware of the config feature. Best Lewis On Thu, Mar 1, 2018 at 12:17 AM, wrote: > > > From: Nick Burch > To: dev@tika.apache.org > Cc: > Bcc: > Date: Wed, 28 Feb 2018 08:03:53 + (GMT) > Subject: Re:

[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220 ] Md edited comment on TIKA-2593 at 3/1/18 4:11 PM: -- I would like to do few things *

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220 ] Md commented on TIKA-2593: -- I would like to do few things * exclude comments * possibly exclude header and

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382207#comment-16382207 ] Tim Allison commented on TIKA-2593: --- bq. Which shapes are being extracted, are you able to share an

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382197#comment-16382197 ] Tim Allison commented on TIKA-2592: --- bq. What I know is: we can't rely on charset encoding in meta-tags

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382193#comment-16382193 ] Md commented on TIKA-2593: -- I am talking about this ticket and for example you can see the attached file in the

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382177#comment-16382177 ] Tim Allison commented on TIKA-2593: --- bq. I wanted to exclude shape based content by

[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382167#comment-16382167 ] Md edited comment on TIKA-2593 at 3/1/18 3:33 PM: -- No deleted content is not showing if we

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382167#comment-16382167 ] Md commented on TIKA-2593: -- No deleted content is not showing if we do   

[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382125#comment-16382125 ] Tim Allison edited comment on TIKA-2593 at 3/1/18 3:09 PM: --- bq. I think I did

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382125#comment-16382125 ] Tim Allison commented on TIKA-2593: --- bq. I think I did figure it out. I need to set

[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

2018-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382088#comment-16382088 ] Nick Burch commented on TIKA-2591: -- If you detect with mime magic only, we should correctly spot this as a

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382014#comment-16382014 ] Md commented on TIKA-2593: -- I think I did figure it out. I need to set 

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382005#comment-16382005 ] Md commented on TIKA-2593: -- I notice it works nicely when I am asking to exclude header and footer by 

Re: Tika 1.18?

2018-03-01 Thread Nick Burch
On Thu, 1 Mar 2018, Allison, Timothy B. wrote: There have been some important bug fixes, a few new capabilities, and the upgrading of dependencies because of CVEs. There are a bunch of mime tickets from Andreas Meier that I’d like to get into 1.18. Is there anything else that is critical?

[jira] [Updated] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Md updated TIKA-2593: - Description: I am using following code to extract text from docx file  {code:java} AutoDetectParser parser = new

Tika 1.18?

2018-03-01 Thread Allison, Timothy B.
All, There have been some important bug fixes, a few new capabilities, and the upgrading of dependencies because of CVEs. There are a bunch of mime tickets from Andreas Meier that I’d like to get into 1.18. Is there anything else that is critical? Schedule wise, I propose getting changes in

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Andreas Meier (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381667#comment-16381667 ] Andreas Meier commented on TIKA-2592: - Thanks for your response [~kkrugler] You are right, "unicode"