Re: Extracting rotated text

Peter Murray-Rust Mon, 25 Sep 2017 11:02:40 -0700

In Contentmine.org we are developing software downstream of PDFBox that
analyses some of  this problem. We are extracting complete data from
graphical plots , see https://arxiv.org/html/1709.02261 .
Where the PDF contains vector representations of paths and explicit
characters (rather than pixels) we turn this into SVG and then assemble the
components. In favourable cases the complete data from a plot can be
recovered.
There are several sorts of rotated characters in your 60 page PDF:
* completely rotated slides (quarter turn)
* x- and y-axes
* 45-degree rotation for categorical variables in plots


Our strategy is to extract all components into "CacheProcessors" which
identify many features including the degrees of rotation of text. We then
heuristically build higher level objects such as phrases and lists. Our
primary input is scholarly publications but I would expect slides would
also be processable.

The code is at http://github.com/ContentMine/svg2xml though it's under
rapid development at the moment.




On Mon, Sep 25, 2017 at 6:43 PM, Allison, Timothy B. <[email protected]>
wrote:

> Thank you, Tilman.  I haven't looked yet, but to confirm, there's no page
> parameter that specifies that the text has been rotated?
>
> Back to language modeling... 😊  Thank you, again!
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:[email protected]]
> Sent: Monday, September 25, 2017 1:39 PM
> To: [email protected]
> Subject: Re: Extracting rotated text
>
> No good idea except call setRotate() on the page and then do text
> extraction.
>
> A possible strategy might be to do all rotations and see which one brings
> most known words.
>
> Tilman
>
>
> Am 25.09.2017 um 19:31 schrieb Allison, Timothy B.:
> > Colleagues,
> > Any recommendations for extracting rotated text such as:
> https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-
> 44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES ?
> >
> > Adobe DC gets reasonable text with "save as text".  PDFBox's ExtractText
> (and Tika) get something like this:
> >
> > FS
> > IS
> > L
> > is
> > te
> > ria
> > Li
> > st
> > er
> > ia
> > R
> > is
> > k
> > R
> > is
> > k
> > As
> > se
> > ss
> > m
> > en
> >
> > Thank you!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting rotated text

Reply via email to