One more thing, regarding application/xml vs text/xml I think I'll skip application/xml for now and just include text/xml
Assuming application/xml is compressed XML such as Open office documents and text/xml as uncompressed XML On Fri, Jan 19, 2018 at 10:23 AM Kudrettin Güleryüz <[email protected]> wrote: > This is a source code search engine, some of the users here are like some > humans over there :) So, yes XML files, source code files are human > readable in my definition. But I think I get your point: Rather than > detecting binary or not, decide which mime-types to allow and use tika to > get mime type of files in runtime when traversing the file system for > indexing. > > Thank you > > > > On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <[email protected]> wrote: > >> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: >> > I am not an expert on mime types and how they extend. My definition of >> > binary is any file that is not in human readable form. Any other file, >> > I'd like to index. Would that answer your question? >> >> Some of us humans here can read a wide range of formats than others, >> especially if we go slowly... ;) >> >> For now, I'd suggest you start with: >> * Does the mimetype start with text/ ? >> * If not, check all parents (supertypes) to see if any of those start >> with text/ >> >> Then: >> * Try a few formats with a parent of application/xml, and see if you >> want >> to include or exclude those (are they human readable enough?) >> * Try a few formats with a parent of text/xml or text/html, and see if >> you want to include or exclude them (ditto on really human readable) >> >> Use >> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType- >> to get the parent types >> >> Use >> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType-- >> to check if a mimetype if text/ or not (check for >> getType().equals("text")) >> >> Nick > >
