I've found something interesting here:

http://www.pdflib.com/fileadmin/pdflib/pdf/manuals/TET-4.1-manual.pdf


"Area of text extraction. By default, TET will extract all text from the 
visible page area. 
Using the clippingarea option of open_page( ) (see Table 10.10, page 172) you 
can change 
this to any of the PDF page box entries (e.g. TrimBox). With the keyword 
unlimited all 
text regardless of any page boxes can be extracted. The default value cropbox 
instructs 
TET to extract text within the area which is visible in Acrobat."

How can I have the same behavior using Tika?
Thanks a lot




>________________________________
> Da: Brad Stallion <[email protected]>
>A: "[email protected]" <[email protected]> 
>Inviato: Giovedì 21 Febbraio 2013 14:10
>Oggetto: Re: Tika and invisible text from pdf
> 
>
>Hi Samir and thanks for your response.
>I've already tried and it makes no difference, at least with default settings.
>I attach a small pdf that shows what I mean: how do extract only "visible 
>text"?
>
>
>If you try pdftotext (I'm using ubuntu 12.10), it skips the invisible text.
>
>
>Thanks
>
>
>
>>________________________________
>> Da: samir pendharkar <[email protected]>
>>A: [email protected]; Brad Stallion <[email protected]> 
>>Inviato: Giovedì 21 Febbraio 2013 13:21
>>Oggetto: Re: Tika and invisible text from pdf
>> 
>>
>>In such cases what works best is look at the "Structured Text" view in TIKA 
>>GUI.
>>
>>You might be able to skip tags that you don't want in the output(assuming 
>>invisible part is in some different tag). 
>>
>>
>>
>>On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <[email protected]> wrote:
>>
>>Hi all,
>>>
>>>I'm extracting text from PDF files using my own sax handler. The problem is 
>>>that I get both visible and invisible text, i.e. text contained in invisible 
>>>parts of the layout.
>>>How can I identify the invisible parts?
>>>
>>>I've asked to stack overflow as well:
>>>
>>>http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf
>>>
>>>Thanks a lot for your help!
>>>
>>>bye
>>>
>>
>>
>>
>
>

Reply via email to