Re: Tika and invisible text from pdf

Brad Stallion Thu, 21 Feb 2013 05:11:04 -0800

Hi Samir and thanks for your response.
I've already tried and it makes no difference, at least with default settings.
I attach a small pdf that shows what I mean: how do extract only "visible text"?


If you try pdftotext (I'm using ubuntu 12.10), it skips the invisible text.

Thanks



>________________________________
> Da: samir pendharkar <[email protected]>
>A: [email protected]; Brad Stallion <[email protected]> 
>Inviato: Giovedì 21 Febbraio 2013 13:21
>Oggetto: Re: Tika and invisible text from pdf
> 
>
>In such cases what works best is look at the "Structured Text" view in TIKA 
>GUI.
>
>You might be able to skip tags that you don't want in the output(assuming 
>invisible part is in some different tag). 
>
>
>
>On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <[email protected]> wrote:
>
>Hi all,
>>
>>I'm extracting text from PDF files using my own sax handler. The problem is 
>>that I get both visible and invisible text, i.e. text contained in invisible 
>>parts of the layout.
>>How can I identify the invisible parts?
>>
>>I've asked to stack overflow as well:
>>
>>http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf
>>
>>Thanks a lot for your help!
>>
>>bye
>>
>
>
>

visible.pdf
Description: Adobe PDF document

Re: Tika and invisible text from pdf

Reply via email to