Thank you Jukka very much. I will use your code.
I will also rethink my whole logic about parsing text. Avi. On Thu, Aug 14, 2014 at 4:04 PM, Jukka Zitting <[email protected]> wrote: > Hi, > > On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <[email protected]> wrote: > > How do I identify content types which can't be read as text (in notepad > for > > example) because they have some binary content in them. > > You can use use the media type relationship information stored in > Tika's type registry, like this: > > Tika tika = new Tika(); > MediaType type = MediaType.parse(tika.detect(...)); > > MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry(); > if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) { > // process text > } else { > // process binary > } > > > > [...] if it finds text-parsable content, I want it to take the content > as it is > > Note that consuming text data can be surprisingly difficult given all > the different character encodings out there. Tika's parser classes > contain quite a bit of logic for automatically figuring out the > correct character encoding and other details needed for correctly > consuming text data. > > What's your reason for wanting to process text data separately? Is > there some missing feature in Tika that would help achieve your use > case without the need for custom processing of text data? > > For example the HTML parser supports the IdentityHtmlMapper feature > for skipping the HTML simplification that Tika does by default. To > activate that feature, you can pass an IdentityHtmlMapper instance in > the parse context: > > ParseContext context = new ParseContext(); > context.set(HtmlMapper.class, new IdentityHtmlMapper(); > > -- > Jukka Zitting >
