Hi Elliott, I think the answer to your question is that Tika does not perform OCR on any format.
Some PDF files contain text and layout information instead of images. In this case, a PDF text extractor can calculate how the text will be rendered on a page and from that information figure out what text goes together and extract it. In other words, while PDF text extractors work much harder than text extractors for simpler formats, they are still starting with text embedded in the format instead of using OCR to identify characters in an image. Tika does not extract text from PDFs if the PDF only contains images. I know even less about the TIFF format than I do about the PDF format, mut I think it strictly contains image formats and the only way to get body text from a TIFF is through OCR. Since Tika doesn't perform OCR, I don't think you can get body text from a TIFF using TIKA. I hope this helps. Paul On Fri, Mar 11, 2011 at 10:16 AM, Eliott <[email protected]> wrote: > Hi! > > Can anybody point me into the right direction? this text in tiff seems to > be a special tag used by Microsoft and some other applications. > > regards > eliott > > > > On 10/03/2011 16:18, Eliott wrote: > >> Dear Users! >> >> We are using tika indirectly for a project based on jackrabbit. during the >> final phase of this project came into my attention that tiff files are also >> capable of storing the image and the ocr-ed text in a same file, just like >> PDFs do. Since we have many of such files, we have a business need to >> extract text from these tiffs to be able to do full text searches. As I >> understand tikka does not support this functionality in case of tiffs, while >> pdfs do work ok. Is there any special reason for this? >> >> Has anybody written a text extractor or knows a library that can get the >> text layer from these files? >> >> thanks in advance >> eliott >> >> >
