Hi Elliott,

I think the answer to your question is that Tika does not perform OCR on any
format.

Some PDF files contain text and layout information instead of images. In
this case, a PDF text extractor can calculate how the text will be rendered
on a page and from that information figure out what text goes together and
extract it.

In other words, while PDF text extractors work much harder than text
extractors for simpler formats, they are still starting with text embedded
in the format instead of using OCR to identify characters in an image. Tika
does not extract text from PDFs if the PDF only contains images.

I know even less about the TIFF format than I do about the PDF format, mut I
think it strictly contains image formats and the only way to get body text
from a TIFF is through OCR. Since Tika doesn't perform OCR, I don't think
you can get body text from a TIFF using TIKA.

I hope this helps.

Paul

On Fri, Mar 11, 2011 at 10:16 AM, Eliott <[email protected]> wrote:

> Hi!
>
> Can anybody point me into the right direction? this text in tiff seems to
> be a special tag used by Microsoft and some other applications.
>
> regards
> eliott
>
>
>
> On 10/03/2011 16:18, Eliott wrote:
>
>> Dear  Users!
>>
>> We are using tika indirectly for a project based on jackrabbit. during the
>> final phase of this project came into my attention that tiff files are also
>> capable of storing the image and the ocr-ed text in a same file, just like
>> PDFs do. Since we have many of such files, we have a business need to
>> extract text from these tiffs to be able to do full text searches. As I
>> understand tikka does not support this functionality in case of tiffs, while
>> pdfs do work ok.  Is there any special reason for this?
>>
>> Has anybody written a text extractor or knows a library that can get the
>> text layer from these files?
>>
>> thanks in advance
>> eliott
>>
>>
>

Reply via email to