Not exactly, My question is:
How do I identify content types which can't be read as text (in notepad for example) because they have some binary content in them. For example: application/atom+xml - is text for me as it doesn't contain any binary content application/json - ditto But application/pdf - contains binary content audio/ogg - contains binary content My web crawler - crawls the web and if it finds text-parsable content, I want it to take the content as it is, but if the content contains binary content I want to take the Tika parsing of it... On Fri, Aug 8, 2014 at 8:46 PM, Ken Krugler <[email protected]> wrote: > Hi Avi, > > Just to clarify, are you asking for some way to determine whether a given > file (format) will never return any text (other than metadata)? > > Thanks, > > -- Ken > > On Aug 7, 2014, at 11:28pm, Avi Hayun <[email protected]> wrote: > > Hi, > > I am crawling my site and am using Tika for binary content parsing. > > But, how can I know if a certain url contains binary content or plain text > ? > > I can get the contentType. > > > So for now I am using: > if (typeStr.contains("image") || typeStr.contains("audio") || > typeStr.contains("video") || typeStr.contains("application")) { > return true; > } > > > Which is dumb code. > > I will replace the plain strings with Tika's MediaType objects but still I > need better code > > Does anyone have any better idea ? > > > > > Thank you for your help, > Avi > > > >
