Hi Samir and thanks for your response. I've already tried and it makes no difference, at least with default settings. I attach a small pdf that shows what I mean: how do extract only "visible text"?
If you try pdftotext (I'm using ubuntu 12.10), it skips the invisible text. Thanks >________________________________ > Da: samir pendharkar <[email protected]> >A: [email protected]; Brad Stallion <[email protected]> >Inviato: Giovedì 21 Febbraio 2013 13:21 >Oggetto: Re: Tika and invisible text from pdf > > >In such cases what works best is look at the "Structured Text" view in TIKA >GUI. > >You might be able to skip tags that you don't want in the output(assuming >invisible part is in some different tag). > > > >On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <[email protected]> wrote: > >Hi all, >> >>I'm extracting text from PDF files using my own sax handler. The problem is >>that I get both visible and invisible text, i.e. text contained in invisible >>parts of the layout. >>How can I identify the invisible parts? >> >>I've asked to stack overflow as well: >> >>http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf >> >>Thanks a lot for your help! >> >>bye >> > > >
visible.pdf
Description: Adobe PDF document
