One more thing, regarding application/xml vs text/xml
I think I'll skip application/xml for now and just include text/xml

Assuming application/xml is compressed XML such as Open office documents
and text/xml as uncompressed XML


On Fri, Jan 19, 2018 at 10:23 AM Kudrettin Güleryüz <[email protected]>
wrote:

> This is a source code search engine, some of the users here are like some
> humans over there :) So, yes XML files, source code files are human
> readable in my definition. But I think I get your point: Rather than
> detecting binary or not, decide which mime-types to allow and use tika to
> get mime type of files in runtime when traversing the file system for
> indexing.
>
> Thank you
>
>
>
> On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <[email protected]> wrote:
>
>> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
>> > I am not an expert on mime types and how they extend.  My definition of
>> > binary is any file that is not in human readable form. Any other file,
>> > I'd like to index. Would that answer your question?
>>
>> Some of us humans here can read a wide range of formats than others,
>> especially if we go slowly... ;)
>>
>> For now, I'd suggest you start with:
>>   * Does the mimetype start with text/ ?
>>   * If not, check all parents (supertypes) to see if any of those start
>>     with text/
>>
>> Then:
>>   * Try a few formats with a parent of application/xml, and see if you
>> want
>>     to include or exclude those (are they human readable enough?)
>>   * Try a few formats with a parent of text/xml or text/html, and see if
>>     you want to include or exclude them (ditto on really human readable)
>>
>> Use
>> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
>> to get the parent types
>>
>> Use
>> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
>> to check if a mimetype if text/ or not (check for
>> getType().equals("text"))
>>
>> Nick
>
>

Reply via email to