I defer to my colleagues on PDFBox, but the unicode mapping warning means what it says -- there is no way (short of nlp/language modeling/ai) to reconstruct how to map the characters as stored in the document to the correct unicode equivalents. The electronic text stored within the PDF may or may not reflect the presentation layer, and with no unicode mapping in the attached...it doesn't.
If you "save as text" the attached file with Adobe Reader, you also get garbage: FGHIJKLIMHNNOPQMRSMNQTLIPMHMQPQMNLUVWHJQMXYZ[\Y]Y^[_Y'aYbacdYedY'fa__ghiYedYjkljmYnfiha\hY'dYo\p\ch_dYedcYq\papndcYr[\Yiha\hYnsotihdphYts[_ _dodhh_dY'dcYaodpedcYdhYann_ Again, short of AI, your best bet is to run OCR (tesseract) on these files. Somewhere on my plate is to integrate tika-eval _into_ the PDFParser to determine when mojibake is being extracted and run OCR (TIKA-2749?)...that's likely several months off. Sorry I can't help... On Mon, Apr 1, 2019 at 5:26 PM Giovanni De Stefano <[email protected]> wrote: > > Hello, > > > > I am having trouble extracting data from a bunch of pdf. > > > > The output I get is something like: > > > > cd\pYe[Ŷd_z\ndYedYnspn\̀\ah\spYv\cnàdY > ỲaY€d̀̀[̀dYedcYcapnh\spcYaeo\p\ch_ah\zdcY ̂€‚ƒmYr[\YaY_dt_\cYndcYnsotihdpndcw > > „S……KMV†SNMWL…MHULIRL…M‡ˆ„MXY‰spY]YŠdY€̂‚YpfdchYnsotihdphYr[fdpYoah\‹_dYedY_do\cdYdhŒs[Y_ie[nh\spYedcYann_s\ccdodphcYef\otuhYdhYaodpedc > > v\cnàdcYaeo\p\ch_ah\zdcYdpYoah\‹_dYef\otuhYc[_ỲdcY_dzdp[cmYedYha > dcYacc\o\̀idcYa[ Y\otuhcYc[_ỲdcY_dzdp[cYdhYedYe_s\hcYdhYha dcYe\zd_cdcwYŠdc > > _do\cdcYdhŒs[Y_ie[nh\spcYefaodpedcYŽ > ‚YpdYcsphYespnYtacYz\cidcwYŠdYo\p\ch_dYedcYq\papndcYs[YcspYvspnh\sppa\_dYeìi > [iY_dchdphYespnYnsotihdphc > > a_hwY}Ya__ghiYe[Y_i dphmYjkw|lwjkljƒYw > > > > The logs inform me that that many Unicode mapping are missing: > > > > WARN No Unicode mapping for 87 (87) in font null > > WARN No Unicode mapping for 88 (88) in font null > > WARN No Unicode mapping for .notdef (89) in font null > > WARN No Unicode mapping for 90 (90) in font null > > WARN No Unicode mapping for 91 (91) in font null > > WARN No Unicode mapping for 92 (92) in font null > > > > I can reproduce this behavior with a vanilla Tika Server 1.20. > > > > I attach the pdf here. > > > > What could be wrong? Any idea on the steps I can take to properly extract > metadata and body? > > > > Thanks a lot, > > Giovanni > > > > > > > >
