Hi Dave, no, not yet, good idea. In case there exists some parameter to tune in PDFBox, how can I access to it directly? Thanks
>________________________________ > Da: Dave Meikle <[email protected]> >A: [email protected]; Brad Stallion <[email protected]> >Inviato: Domenica 10 Marzo 2013 0:53 >Oggetto: Re: Tika and invisible text from pdf > >Hi Brad, > >On 21 Feb 2013, at 11:28, Brad Stallion <[email protected]> wrote: > >> I'm extracting text from PDF files using my own sax handler. The problem is >> that I get both visible and invisible text, i.e. text contained in invisible >> parts of the layout. >> How can I identify the invisible parts? > >We use PDFBox under the hood in Tika. Have you tried asking on their user >list? > >Cheers, >Dave > >
