Thank you Jukka very much.

I will use your code.


I will also rethink my whole logic about parsing text.



Avi.


On Thu, Aug 14, 2014 at 4:04 PM, Jukka Zitting <[email protected]> wrote:

> Hi,
>
> On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <[email protected]> wrote:
> > How do I identify content types which can't be read as text (in notepad
> for
> > example) because they have some binary content in them.
>
> You can use use the media type relationship information stored in
> Tika's type registry, like this:
>
>     Tika tika = new Tika();
>     MediaType type = MediaType.parse(tika.detect(...));
>
>     MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
>     if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
>         // process text
>     } else {
>         // process binary
>     }
>
>
> > [...] if it finds text-parsable content, I want it to take the content
> as it is
>
> Note that consuming text data can be surprisingly difficult given all
> the different character encodings out there. Tika's parser classes
> contain quite a bit of logic for automatically figuring out the
> correct character encoding and other details needed for correctly
> consuming text data.
>
> What's your reason for wanting to process text data separately? Is
> there some missing feature in Tika that would help achieve your use
> case without the need for custom processing of text data?
>
> For example the HTML parser supports the IdentityHtmlMapper feature
> for skipping the HTML simplification that Tika does by default. To
> activate that feature, you can pass an IdentityHtmlMapper instance in
> the parse context:
>
>     ParseContext context = new ParseContext();
>     context.set(HtmlMapper.class, new IdentityHtmlMapper();
>
> --
> Jukka Zitting
>

Reply via email to