[GitHub] [tika] dependabot[bot] opened a new pull request #522: Bump twelvemonkeys.version from 3.8.1 to 3.8.2

2022-03-03 Thread GitBox
dependabot[bot] opened a new pull request #522: URL: https://github.com/apache/tika/pull/522 Bumps `twelvemonkeys.version` from 3.8.1 to 3.8.2. Updates `common-io` from 3.8.1 to 3.8.2 Updates `imageio-bmp` from 3.8.1 to 3.8.2 Updates `imageio-jpeg` from 3.8.1 to 3.8.2

[GitHub] [tika] dependabot[bot] opened a new pull request #521: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2

2022-03-03 Thread GitBox
dependabot[bot] opened a new pull request #521: URL: https://github.com/apache/tika/pull/521 Bumps [maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.3.1 to 3.3.2. Commits

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501040#comment-17501040 ] Hudson commented on TIKA-3687: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #475 (See

[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500990#comment-17500990 ] Tim Allison edited comment on TIKA-3687 at 3/3/22, 7:13 PM: Thank you

[jira] [Resolved] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3687. --- Fix Version/s: 2.3.1 Resolution: Fixed Thank you! > Email file detected as text/html >

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500986#comment-17500986 ] ASF GitHub Bot commented on TIKA-3687: -- tballison merged pull request #520: URL:

[GitHub] [tika] tballison merged pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox
tballison merged pull request #520: URL: https://github.com/apache/tika/pull/520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500974#comment-17500974 ] ASF GitHub Bot commented on TIKA-3687: -- SchwingSK commented on a change in pull request #520: URL:

[GitHub] [tika] SchwingSK commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox
SchwingSK commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818959865 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment:

[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500962#comment-17500962 ] Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM: --- Created a

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500966#comment-17500966 ] ASF GitHub Bot commented on TIKA-3687: -- tballison commented on a change in pull request #520: URL:

[GitHub] [tika] tballison commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox
tballison commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818943762 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment:

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500962#comment-17500962 ] Thierry Guérin commented on TIKA-3687: -- Created a pull request:

[jira] [Updated] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thierry Guérin updated TIKA-3687: - Description: The attached email (which I redacted from a real email received from Office365) is

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500953#comment-17500953 ] ASF GitHub Bot commented on TIKA-3687: -- SchwingSK opened a new pull request #520: URL:

[GitHub] [tika] SchwingSK opened a new pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox
SchwingSK opened a new pull request #520: URL: https://github.com/apache/tika/pull/520 1024 is maybe a bit overkill for the X|DKIM|ARC headers lookahead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[jira] [Created] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira
Thierry Guérin created TIKA-3687: Summary: Email file detected as text/html Key: TIKA-3687 URL: https://issues.apache.org/jira/browse/TIKA-3687 Project: Tika Issue Type: Bug Affects

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500904#comment-17500904 ] Tim Allison commented on TIKA-3668: --- When I ran the same test against 1.26 with tesseract removed via

[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500874#comment-17500874 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM: Thank you. I

[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500876#comment-17500876 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:42 PM: I just added the

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500876#comment-17500876 ] Tim Allison commented on TIKA-3668: --- I just added the two config files I was using. > High CPU

[jira] [Updated] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3668: -- Attachment: tika-config.xml tika-config-no-tess.xml > High CPU utilization in Tika

[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500874#comment-17500874 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:40 PM: Thank you. I

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500875#comment-17500875 ] Tim Allison commented on TIKA-3668: --- I'm not denying you're seeing what you're seeing. I regret that if

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500874#comment-17500874 ] Tim Allison commented on TIKA-3668: --- Thank you. I tried three things this morning. 1) Manually

[jira] [Comment Edited] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Vincent Massol (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500814#comment-17500814 ] Vincent Massol edited comment on TIKA-3686 at 3/3/22, 2:57 PM: --- [~nick]

[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Vincent Massol (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500814#comment-17500814 ] Vincent Massol commented on TIKA-3686: -- [~nick] Thanks. Note that it seems to be some sort of

[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500804#comment-17500804 ] Nick Burch commented on TIKA-3686: -- Detecting types of text-based files with magic is always going to

[jira] [Commented] (TIKA-3682) PDFParser is extracting each char of a word in a new line

2022-03-03 Thread Sree Harsha (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500792#comment-17500792 ] Sree Harsha commented on TIKA-3682: --- Apologies for delayed response.. Creating a sample file and will

[jira] [Created] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Marius Dumitru Florea (Jira)
Marius Dumitru Florea created TIKA-3686: --- Summary: CSS file detected as JavaScript (application/javascript) Key: TIKA-3686 URL: https://issues.apache.org/jira/browse/TIKA-3686 Project: Tika

[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500173#comment-17500173 ] Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:42 PM: - I attached an

[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500173#comment-17500173 ] Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:41 PM: - I attached an

[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500720#comment-17500720 ] Tim Allison commented on TIKA-3684: --- Oops. Thank you! > Extract text returns the text multiple times >

[GitHub] [tika] tballison merged pull request #519: Bump commons-net from 3.7.2 to 3.8.0

2022-03-03 Thread GitBox
tballison merged pull request #519: URL: https://github.com/apache/tika/pull/519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [tika] tballison merged pull request #517: Bump bndlib from 1.50.0 to 2.0.0.20130123-133441

2022-03-03 Thread GitBox
tballison merged pull request #517: URL: https://github.com/apache/tika/pull/517 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [tika] tballison merged pull request #518: Bump build-helper-maven-plugin from 3.0.0 to 3.3.0

2022-03-03 Thread GitBox
tballison merged pull request #518: URL: https://github.com/apache/tika/pull/518 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: