Extracting text into paragraphs

João Cardoso Wed, 29 Oct 2014 09:18:09 -0700

Hi,

I'm a researcher at INESC-ID and I'm currently working on an application
that intends to parse ISO standards (stored in PDF files) and store their
text into a database. This implies building some sort of tree with all the
sections and subsections and so on...


Well I'm aware that PDF files don't reflect text structure so I was aiming
for a different approach. Just being able to have the text split into
paragraphs would aready be a massive help. An amazing help would be to have
a way to differ between text styles so as to sort normal text from headings
and all that.

Well I've managed to extract plain text with your API. And with a lot of
effot it would be possible to organize that plain text and provide it with
some structure.

However, I was wondering if your API does not provide an easier way to do
this. Maybe using some sort of object iteration within a page?

Thanks for the help.

Best regards,

  *João M. F. Cardoso*
MSc in Telecommunications and Informatics Engineering, INESC-ID
 m:(+351) 916190940 | e:[email protected] | a: Skype:
joao.m.f.cardoso
   Get a signature like this:
<http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>
Click
here!
<http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>

Extracting text into paragraphs

Reply via email to