In Contentmine.org we are developing software downstream of PDFBox that analyses some of this problem. We are extracting complete data from graphical plots , see https://arxiv.org/html/1709.02261 . Where the PDF contains vector representations of paths and explicit characters (rather than pixels) we turn this into SVG and then assemble the components. In favourable cases the complete data from a plot can be recovered. There are several sorts of rotated characters in your 60 page PDF: * completely rotated slides (quarter turn) * x- and y-axes * 45-degree rotation for categorical variables in plots
Our strategy is to extract all components into "CacheProcessors" which identify many features including the degrees of rotation of text. We then heuristically build higher level objects such as phrases and lists. Our primary input is scholarly publications but I would expect slides would also be processable. The code is at http://github.com/ContentMine/svg2xml though it's under rapid development at the moment. On Mon, Sep 25, 2017 at 6:43 PM, Allison, Timothy B. <[email protected]> wrote: > Thank you, Tilman. I haven't looked yet, but to confirm, there's no page > parameter that specifies that the text has been rotated? > > Back to language modeling... 😊 Thank you, again! > > -----Original Message----- > From: Tilman Hausherr [mailto:[email protected]] > Sent: Monday, September 25, 2017 1:39 PM > To: [email protected] > Subject: Re: Extracting rotated text > > No good idea except call setRotate() on the page and then do text > extraction. > > A possible strategy might be to do all rotations and see which one brings > most known words. > > Tilman > > > Am 25.09.2017 um 19:31 schrieb Allison, Timothy B.: > > Colleagues, > > Any recommendations for extracting rotated text such as: > https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78- > 44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES ? > > > > Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText > (and Tika) get something like this: > > > > FS > > IS > > L > > is > > te > > ria > > Li > > st > > er > > ia > > R > > is > > k > > R > > is > > k > > As > > se > > ss > > m > > en > > > > Thank you! > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

