This is a source code search engine, some of the users here are like some
humans over there :) So, yes XML files, source code files are human
readable in my definition. But I think I get your point: Rather than
detecting binary or not, decide which mime-types to allow and use tika to
get mime type of files in runtime when traversing the file system for
indexing.

Thank you



On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <[email protected]> wrote:

> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> > I am not an expert on mime types and how they extend.  My definition of
> > binary is any file that is not in human readable form. Any other file,
> > I'd like to index. Would that answer your question?
>
> Some of us humans here can read a wide range of formats than others,
> especially if we go slowly... ;)
>
> For now, I'd suggest you start with:
>   * Does the mimetype start with text/ ?
>   * If not, check all parents (supertypes) to see if any of those start
>     with text/
>
> Then:
>   * Try a few formats with a parent of application/xml, and see if you want
>     to include or exclude those (are they human readable enough?)
>   * Try a few formats with a parent of text/xml or text/html, and see if
>     you want to include or exclude them (ditto on really human readable)
>
> Use
> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
> to get the parent types
>
> Use
> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
> to check if a mimetype if text/ or not (check for getType().equals("text"))
>
> Nick

Reply via email to