I can successfully detect valid html files in other encodings but when a valid 
file is encoded as UTF-16 it is identified as plain/text.  I can see that in 
tika-mimetypes.xml the UTF_16 BOMs are used to identify files as text/plain 
with a priority of 20 and *.html identification is set to a priority of 40. I'm 
not sure why this is the case.

I see the advice here is not to alter  tika-mimetypes.xml (and indeed that 
would be a pain to maintain) and suggests that custom-mimetypes.xml should be 
used for new file types. However, I want to overwrite the definition for the 
existing text/plain type to reduce the priority or remove the UTF-16 magic 
signs so my valid UTF-16 html files are correctly identified.

Is this possible or is there a better way to achieve my aim of correctly 
identifying my UTF-16 html files as I can with those in other encodings?

George







Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Reply via email to