I want to be able to detect when a file is html even when it is utf-16 encoded. I can see from the default tika-mimetypes.xml that normally files with a BOM will be detected as text/plain, which is the case. I have tried creating my own versions of the html and text mime types in a custom-mimetypes.xml and these successfully overwrite the original ones but changing the priority of these does not force the utf-16 files to be identified as html. Even removing the BOM matches completely from the text mimetype in the custom-mimetypes.xml does not work.
So I tried another approach by removing the BOM from the inputstream before detecting. However the utf-16 file is still not recognised as html, despite the tect having multiple matches. It seems that the detect method does not realise what encoding is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection? Thanks George
