On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
I am not an expert on mime types and how they extend. My definition of binary is any file that is not in human readable form. Any other file, I'd like to index. Would that answer your question?

Some of us humans here can read a wide range of formats than others, especially if we go slowly... ;)

For now, I'd suggest you start with:
 * Does the mimetype start with text/ ?
 * If not, check all parents (supertypes) to see if any of those start
   with text/

Then:
 * Try a few formats with a parent of application/xml, and see if you want
   to include or exclude those (are they human readable enough?)
 * Try a few formats with a parent of text/xml or text/html, and see if
   you want to include or exclude them (ditto on really human readable)

Use 
https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
to get the parent types

Use 
http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
to check if a mimetype if text/ or not (check for getType().equals("text"))

Nick

Reply via email to