This is a source code search engine, some of the users here are like some humans over there :) So, yes XML files, source code files are human readable in my definition. But I think I get your point: Rather than detecting binary or not, decide which mime-types to allow and use tika to get mime type of files in runtime when traversing the file system for indexing.
Thank you On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <[email protected]> wrote: > On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: > > I am not an expert on mime types and how they extend. My definition of > > binary is any file that is not in human readable form. Any other file, > > I'd like to index. Would that answer your question? > > Some of us humans here can read a wide range of formats than others, > especially if we go slowly... ;) > > For now, I'd suggest you start with: > * Does the mimetype start with text/ ? > * If not, check all parents (supertypes) to see if any of those start > with text/ > > Then: > * Try a few formats with a parent of application/xml, and see if you want > to include or exclude those (are they human readable enough?) > * Try a few formats with a parent of text/xml or text/html, and see if > you want to include or exclude them (ditto on really human readable) > > Use > https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType- > to get the parent types > > Use > http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType-- > to check if a mimetype if text/ or not (check for getType().equals("text")) > > Nick
