Detecting html file which is urf-16 encoded

george Tue, 17 Jun 2014 01:04:27 -0700

I want to be able to detect when a file is html even when it is utf-16
encoded. I can see from the default tika-mimetypes.xml that normally files
with a BOM will be detected as text/plain, which is the case.  I have
tried creating my own versions of the html and text mime types in a
custom-mimetypes.xml and these successfully overwrite the original ones but
changing the priority of these does not force the utf-16 files to be
identified as html. Even removing the BOM matches completely from the text
mimetype in the custom-mimetypes.xml does not work.


So I tried another approach by removing the BOM from the inputstream before
detecting. However the utf-16 file is still not recognised as html, despite
the tect having multiple matches. It seems that the detect method does not
realise what encoding is being used for the file. Is there a way to tell a
detector what encoding a file is in to aid detection?

Thanks

George

Detecting html file which is urf-16 encoded

Reply via email to