[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384242#comment-16384242 ] Ken Krugler commented on TIKA-2592: --- [~AndreasMeier] - I assume when you said: {quote}I don't think we

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Attachment: IANA Charset names.txt > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Priority: Minor (was: Major) > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Issue Type: Improvement (was: Bug) > HTML with charset unicode handled as utf-16 instead utf-8 >

Re: Tika 1.18?

2018-03-02 Thread Nick Burch
On Fri, 2 Mar 2018, Luís Filipe Nassif wrote: If I make no progress on TIKA-1466 until 3/9, you can start the release process without it. But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml? What happens if you have two different custom

[jira] [Comment Edited] (TIKA-2569) Grouped Text boxes in .ppt

2018-03-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384155#comment-16384155 ] Tim Allison edited comment on TIKA-2569 at 3/2/18 9:10 PM: --- [~BAEApache], if all

[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt

2018-03-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384155#comment-16384155 ] Tim Allison commented on TIKA-2569: --- [~BAEApache], if all goes according to plan, we'll start the release

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384081#comment-16384081 ] Tim Allison commented on TIKA-2592: --- bq. Do you have a testcorpus or are you crawling the web Tim

[jira] [Commented] (TIKA-2597) Attachment Extraction Case Sensitivity

2018-03-02 Thread Todd Dixon (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384065#comment-16384065 ] Todd Dixon commented on TIKA-2597: -- >From what i read on the FILE_FLAG_POSIX_SEMANTICS flag that will only

[jira] [Commented] (TIKA-2597) Attachment Extraction Case Sensitivity

2018-03-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383880#comment-16383880 ] Nick Burch commented on TIKA-2597: -- Trying to fully re-implement the Windows case-insensitivity rules

[jira] [Created] (TIKA-2597) Attachment Extraction Case Sensitivity

2018-03-02 Thread Todd Dixon (JIRA)
Todd Dixon created TIKA-2597: Summary: Attachment Extraction Case Sensitivity Key: TIKA-2597 URL: https://issues.apache.org/jira/browse/TIKA-2597 Project: Tika Issue Type: Bug

RE: Tika 1.18?

2018-03-02 Thread Allison, Timothy B.
> But do you devs agree with the proposed change: allow overriding of glob > patterns in custom-mimetypes.xml? +1 from me From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Friday, March 2, 2018 8:21 AM To: Allison, Timothy B. Cc: dev@tika.apache.org Subject: Re:

[jira] [Commented] (TIKA-2568) Full encrypted 7Z file not detected as such

2018-03-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383556#comment-16383556 ] Tim Allison commented on TIKA-2568: --- Just added you to the PMC group on JIRA. Sorry for our delay! >

[jira] [Assigned] (TIKA-2568) Full encrypted 7Z file not detected as such

2018-03-02 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-2568: - Assignee: Luis Filipe Nassif > Full encrypted 7Z file not detected as such >

RE: Tika 1.18?

2018-03-02 Thread Allison, Timothy B.
TIKA-2591 and TIKA-2568 +1 TIKA-1466 -- how long will it take, do you think? This seems potentially non-trivial... -Original Message- From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Thursday, March 1, 2018 5:41 PM To: dev@tika.apache.org Subject: Re: Tika 1.18? I think we

[jira] [Comment Edited] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Andreas Meier (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383350#comment-16383350 ] Andreas Meier edited comment on TIKA-2592 at 3/2/18 10:56 AM: -- {quote}Before

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Andreas Meier (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Meier updated TIKA-2592: Attachment: TestHTMLCharsetCP1256.html TestHTMLCharsetArabicCP1256.html > HTML with

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Andreas Meier (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383350#comment-16383350 ] Andreas Meier commented on TIKA-2592: - {quote} Before making this kind of change (default "unicode" to