[jira] [Resolved] (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

Jukka Zitting (JIRA) Sun, 08 Jul 2012 15:47:36 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-431.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
         Assignee: Jukka Zitting  (was: Ken Krugler)

In revision 1358858 I made the text and html parsers return character encoding 
information in the charset parameter of the returned content type. The content 
encoding field is still present for backwards compatibility, but I added a note 
to the CHANGES.txt mentioning that it should be considered deprecated.
                
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-431
>                 URL: https://issues.apache.org/jira/browse/TIKA-431
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>             Fix For: 1.2
>
>         Attachments: TIKA-431.patch
>
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, 
> compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; 
> charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in 
> gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata 
> object
> 3. Return charset=... declarations in the Metadata object if a charset is 
> detected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

Reply via email to