Re: How to identify binary content ?

Jukka Zitting Thu, 14 Aug 2014 06:05:52 -0700

Hi,

On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <[email protected]> wrote:
> How do I identify content types which can't be read as text (in notepad for
> example) because they have some binary content in them.


You can use use the media type relationship information stored in
Tika's type registry, like this:

    Tika tika = new Tika();
    MediaType type = MediaType.parse(tika.detect(...));

    MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
    if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
        // process text
    } else {
        // process binary
    }


> [...] if it finds text-parsable content, I want it to take the content as it 
> is

Note that consuming text data can be surprisingly difficult given all
the different character encodings out there. Tika's parser classes
contain quite a bit of logic for automatically figuring out the
correct character encoding and other details needed for correctly
consuming text data.

What's your reason for wanting to process text data separately? Is
there some missing feature in Tika that would help achieve your use
case without the need for custom processing of text data?

For example the HTML parser supports the IdentityHtmlMapper feature
for skipping the HTML simplification that Tika does by default. To
activate that feature, you can pass an IdentityHtmlMapper instance in
the parse context:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper();

--
Jukka Zitting

Re: How to identify binary content ?

Reply via email to