Not exactly,

My question is:

How do I identify content types which can't be read as text (in notepad for
example) because they have some binary content in them.


For example:
application/atom+xml - is text for me as it doesn't contain any binary
content
application/json - ditto


But
application/pdf - contains binary content
audio/ogg - contains binary content



My web crawler - crawls the web and if it finds text-parsable content, I
want it to take the content as it is, but if the content contains binary
content I want to take the Tika parsing of it...


On Fri, Aug 8, 2014 at 8:46 PM, Ken Krugler <[email protected]>
wrote:

> Hi Avi,
>
> Just to clarify, are you asking for some way to determine whether a given
> file (format) will never return any text (other than metadata)?
>
> Thanks,
>
> -- Ken
>
> On Aug 7, 2014, at 11:28pm, Avi Hayun <[email protected]> wrote:
>
> Hi,
>
> I am crawling my site and am using Tika for binary content parsing.
>
> But, how can I know if a certain url contains binary content or plain text
> ?
>
> I can get the contentType.
>
>
> So for now I am using:
> if (typeStr.contains("image") || typeStr.contains("audio") ||
> typeStr.contains("video") || typeStr.contains("application")) {
>  return true;
> }
>
>
> Which is dumb code.
>
> I will replace the plain strings with Tika's MediaType objects but still I
> need better code
>
> Does anyone have any better idea ?
>
>
>
>
> Thank you for your help,
> Avi
>
>
>
>

Reply via email to