On Tue, Mar 24, 2015 at 9:26 AM, Maruan Sahyoun <[email protected]> wrote:
>... As you would like to remove certain vectors which are matching a certain >character/glyph you first need to find out which are the ones drawing e.g. the letter >'T'. I don't think that this is doable in a reasonable amount of time for arbitary text. >Maruan This is true! And it's unfortunately a common problem with PDFs which use * outline fonts/glyphs * pixel glyphs * scanned text I think it is possible in limited subdomains and we are starting to try to do this in science/maths. Our approach ( https://bitbucket.org/petermr/diagramanalyzer, https://bitbucket.org/petermr/imageanalysis, https://bitbucket.org/petermr/javaocr) is to create tools that recognize text in common fonts. Unfortunately there is no clear library for OCR in Java (we looked at all of them - Tesseract is non-native - and have ended up extending javaocr). Scanned typescript can be a nightmare (missing pixels, bleeding across glyph boundaries, etc.) but sometimes works. In our approach we try to analyze born-digital glyphs by heuristics rather than machine-learning (which needs retraining for all new fonts/size). The vector glyphs have a constant SVG signature for each character and this can sometimes be worked out, or mapped by the crowd). The pixel glyphs are harder and we shrink them to a common skeleton and classify from that. Once one character is done it's usually possible to recognize it in later occurrences. It's early days, but it people are interested in collaborating or have better solutions we'd be interested (we aren't able to help with casual problems). P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

