Dat Pham created JAMES-2581:
-------------------------------
Summary: Configurable ContentType blacklist for Tika
Key: JAMES-2581
URL: https://issues.apache.org/jira/browse/JAMES-2581
Project: James Server
Issue Type: Improvement
Reporter: Dat Pham
Enhanced production logging upon Tika failing call highlight the fact that our
installation of Tika **can not** handle some kinds of attachments.
Here is a log example:
```
org.apache.http.client.HttpResponseException: Unsupported Media Type
at
org.apache.http.impl.client.AbstractResponseHandler.handleResponse(AbstractResponseHandler.java:70)
at org.apache.http.client.fluent.Response.handleResponse(Response.java:90)
at org.apache.http.client.fluent.Response.returnContent(Response.java:97)
at
org.apache.james.mailbox.tika.TikaHttpClientImpl.recursiveMetaDataAsJson(TikaHttpClientImpl.java:62)
at
org.apache.james.mailbox.tika.TikaTextExtractor.performContentExtraction(TikaTextExtractor.java:86)
at
org.apache.james.mailbox.tika.TikaTextExtractor.lambda$extractContent$0(TikaTextExtractor.java:81)
```
(131 matches in the last 2 days)
Here is a list if Content types we recurringly fail on:
- application/ics
- application/zip
- application/pgp-signature
- image/jpg
- image/jpeg
- image/png
- message/delivery-status
As an admin, I should be able to specify in `tika.properties` file a coma
separated list of Content type to blacklist.
Benefits:
- Avoid known-to-be-failing Tika calls - reduce log output
- Avoid transmitting potentially big payload over the network for nothing -
performance
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]