On Fri, Oct 31, 2014 at 3:18 PM, Brzrk One <[email protected]> wrote: > This is exhaustingly difficult to do accurately in the general case. > Narrowing it down to some heuristics that work for your application is > advisable. >
Agreed. It's generally not worth automating it for 1 document, but is worth it for hundreds which have basically the same format. I recall some publication of the IEEE that used statistics on the pixel > density per line (that is, raster) > to make determinations of paragraph changes and table representations. > But that is easily counfounded by graphics and graphical representations of > text. > This is one technique I have used in http://bitbucket.org/petermr/ see SVG2XML (downstream from PDFBOX). It is quite good but can fail for (say) two column text which is wrapped in 1-column text, or vertical bars of text rotated by 90 degrees in the margin or ... > > On Fri, Oct 31, 2014 at 11:12 AM, Walter Kehl <[email protected]> > wrote: > > > Hi Frank, > > > > I am also interested in this topic. If you have some source code to > share, > > could I also participate? > > I was also thinking about using font changes as a heuristics to detect > > paragraphs. Would you know about the best way how to do this? > > > > Thanks and best regards > > > > Walter > > > > -----Original Message----- > > From: Frank van der Hulst [mailto:[email protected]] > > Sent: Mittwoch, 29. Oktober 2014 20:27 > > To: [email protected] > > Subject: Re: Extracting text into paragraphs > > > > Hi João, > > I'm happy to share source code for some work I've done on extracting > > tables from PDF documents. That may be a starting point for you in that > it > > looks for graphic boxes drawn around text to identify table headings. > > > > Frank > > > > On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <[email protected]> wrote: > > > > > You may want to get in contact with Peter Murray-Rust( > > > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge. > > > He seems to have been working on molecular informatics involving > > > extraction of information from PDFs, and probably has faced many of > your > > issues. > > > —Ken Bowen > > > > > > On Oct 29, 2014, at 10:13 AM, João Cardoso < > > > [email protected]> wrote: > > > > > > > Hi, > > > > > > > > I'm a researcher at INESC-ID and I'm currently working on an > > > > application that intends to parse ISO standards (stored in PDF > > > > files) and store their text into a database. This implies building > > > > some sort of tree with all > > > the > > > > sections and subsections and so on... > > > > > > > > Well I'm aware that PDF files don't reflect text structure so I was > > > aiming > > > > for a different approach. Just being able to have the text split > > > > into paragraphs would aready be a massive help. An amazing help > > > > would be to > > > have > > > > a way to differ between text styles so as to sort normal text from > > > headings > > > > and all that. > > > > > > > > Well I've managed to extract plain text with your API. And with a > > > > lot of effot it would be possible to organize that plain text and > > > > provide it > > > with > > > > some structure. > > > > > > > > However, I was wondering if your API does not provide an easier way > > > > to do this. Maybe using some sort of object iteration within a page? > > > > > > > > Thanks for the help. > > > > > > > > Best regards, > > > > > > > > *João M. F. Cardoso* > > > > MSc in Telecommunications and Informatics Engineering, INESC-ID > > > > m:(+351) 916190940 | e:[email protected] | a: > Skype: > > > > joao.m.f.cardoso > > > > Get a signature like this: > > > > < > > > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX > > > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f > > > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > > > > > Click > > > > here! > > > > < > > > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX > > > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f > > > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > > > > > > > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

