[R] Complex? import of pdf files (criminal records) into R table

2009-10-15 Thread Biedermann, Jürgen

Hi there,

I'm facing the decision if it would be possible to transform several 
more or less complex pdf files into an R Table-Format or if it has to be 
done manually. I think it would be a impudent to expect a complete 
solution, but I would be grateful if anyone could give me an advice on 
how the structure of such a R-program could look like, and if it's 
possible in general.


Here the problem:
Each pdf file belongs to a person. The pdf files actually represent the 
anonymous criminal record of a person. Each entry should lead to one row 
with the person number as key. The different lines should form the 
columns. The criminal record actually looks like this:



---
Header with irrelevant text for us   |  Date: xx.xx. (relevant for us)

Anonymous person number: xxx

Entries in the register

1. xx.xx.1902  -City-
   Be in force since: xx.xx.1902
   Date of offense:xx.xx.
   Elements of the offence: For example Rape
   Section in law: §176, §178 Abs. 1
   Sentenced to 5 years imprisonment
   Irrelevant text for us
   Accommodation in an forensic psychiatry
   Accommodation sentenced on probation
   Rest of sentence sentenced on probation until the xx.xx.

2. xx.xx.1910
   Be in force since: 
   .

---

The problem is that the entries do not always have the same structure. 
The first 6 lines are structurally the same in each entry of the 
criminal record (each entry has a line for the judgement date, the be 
in force date, the date of offence, the elements of the offence, the 
Sections in law, and the sentence).


But then depending on the sentence different lines emerge which contain 
information if the person was sentenced on probation, if the probation 
was withdrawn again, when the person was released etc.
So, I think, these lines should be allocated to different columns 
depending on key words. The definition of the key words for most cases 
would not be the problem, actually. If a certain column is not relevant 
in an entry (so, the key word didn't emerge) NA should be put in the place.
But because sometimes (in rare cases), the entries contain spelling 
errors, at the end, all the lines of an entry, which could not be 
allocated to a column should be put in a column to check them manually.


In the end the table should look more of less like this.

--
Per.Numb;EntryNumber;Judg.Date;DateOffen.;...;Probation.until; 
Released;Not allocated


1   1   xx.xx.1902  xx.xx.1901 ... xx.xx.1905 NA  blablabla
1   2   xx.xx.1910  xx.xx.1909 ... NA1925  blablabla
2   1   xx.xx.1924  xx.xx.1923 ... NANA  blablabla
--

Could anyone help me?
Thanks

Greetings
Jürgen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Complex? import of pdf files (criminal records) into R table

2009-10-15 Thread Marc Schwartz

On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:


Hi there,

I'm facing the decision if it would be possible to transform several  
more or less complex pdf files into an R Table-Format or if it has  
to be done manually. I think it would be a impudent to expect a  
complete solution, but I would be grateful if anyone could give me  
an advice on how the structure of such a R-program could look like,  
and if it's possible in general.


Here the problem:
Each pdf file belongs to a person. The pdf files actually represent  
the anonymous criminal record of a person. Each entry should lead to  
one row with the person number as key. The different lines should  
form the columns. The criminal record actually looks like this:



---
Header with irrelevant text for us   |  Date: xx.xx. (relevant  
for us)


Anonymous person number: xxx

Entries in the register

1. xx.xx.1902  -City-
  Be in force since: xx.xx.1902
  Date of offense:xx.xx.
  Elements of the offence: For example Rape
  Section in law: §176, §178 Abs. 1
  Sentenced to 5 years imprisonment
  Irrelevant text for us
  Accommodation in an forensic psychiatry
  Accommodation sentenced on probation
  Rest of sentence sentenced on probation until the xx.xx.

2. xx.xx.1910
  Be in force since: 
  .

---

The problem is that the entries do not always have the same  
structure. The first 6 lines are structurally the same in each entry  
of the criminal record (each entry has a line for the judgement  
date, the be in force date, the date of offence, the elements of  
the offence, the Sections in law, and the sentence).


But then depending on the sentence different lines emerge which  
contain information if the person was sentenced on probation, if the  
probation was withdrawn again, when the person was released etc.
So, I think, these lines should be allocated to different columns  
depending on key words. The definition of the key words for most  
cases would not be the problem, actually. If a certain column is not  
relevant in an entry (so, the key word didn't emerge) NA should be  
put in the place.
But because sometimes (in rare cases), the entries contain spelling  
errors, at the end, all the lines of an entry, which could not be  
allocated to a column should be put in a column to check them  
manually.


In the end the table should look more of less like this.

--
Per 
.Numb;EntryNumber;Judg.Date;DateOffen.;...;Probation.until;  
Released;Not allocated


1   1   xx.xx.1902  xx.xx.1901 ... xx.xx.1905 NA  blablabla
1   2   xx.xx.1910  xx.xx.1909 ... NA1925  blablabla
2   1   xx.xx.1924  xx.xx.1923 ... NANA  blablabla
--

Could anyone help me?
Thanks

Greetings
Jürgen




You don't indicate the OS you are on, but you will want to get a hold  
of 'pdftotext', which is a command line application that can extract  
the textual content from the PDF files. On most Linuxen, it is already  
installed, but for Windows and OSX you will likely need to Google for  
it.


The basic approach is to loop over each PDF file, use pdftotext to get  
the text content and dump it into a regular text file. That file can  
then be read into R using ?readLines.


This can all be done within R using the ?system command. Get the names  
of the PDF files in a given folder by using ?list.files with a \ 
\.pdf or \\.PDF search pattern. Then ?paste together the full  
command using a prefix along the lines of pdftotext -layout - 
nopgbrk, presuming that the pdftotext command is in your $PATH. The  
suffix to be paste()d will be the name of the input PDF file and the  
name of the output text file. So you end up with a command line  
character vector along the lines of:


  pdftotext -layout -nopgbrk x.pdf x.txt

where the x's are the specific file basenames. Review the pdftotext  
options to understand what is being done and if you should need to  
modify them for your particular files.


Once you have the data in R for each file, you will then need to  
process the content line by line, looking for the keywords that are  
associated with the content you require. Using ?grep is perhaps the  
easiest way to accomplish that. You can then use ?gsub to replace/ 
strip the keywords, leaving you with the data only, for each line. For  
multi line scenarios, you will need to keep track of where the keyword  
for the first line is and then look for the subsequent keyword or  
perhaps a blank line, to know when to stop aggregating the data for  
that initial keyword.


It then becomes a matter of reorganizing the content that you need  
into the format you require for subsequent work.


I have not looked for 'text processing' related packages on CRAN, so  
you may wish to look there 

Re: [R] Complex? import of pdf files (criminal records) into R table

2009-10-15 Thread Barry Rowlingson
On Thu, Oct 15, 2009 at 3:28 PM, Marc Schwartz marc_schwa...@me.com wrote:
 On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:

 You don't indicate the OS you are on, but you will want to get a hold of
 'pdftotext', which is a command line application that can extract the
 textual content from the PDF files.

 That's assuming the text is in the PDF as a text object. If it's a
scan of a paper document the chances are that all you have is an
image, in which case you need to do OCR (optical character
recognition) or get someone to type it all in again.

 Even if you can get all the text out with pdftext, R might not be the
right tool for the job - I'd do this kind of text processing and
matching job in Python (and before Python, I'd have used Perl). But if
all you have is a wRench...

Barry

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Complex? import of pdf files (criminal records) into R table

2009-10-15 Thread Marc Schwartz

On Oct 15, 2009, at 10:10 AM, Barry Rowlingson wrote:

On Thu, Oct 15, 2009 at 3:28 PM, Marc Schwartz  
marc_schwa...@me.com wrote:

On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:


You don't indicate the OS you are on, but you will want to get a  
hold of

'pdftotext', which is a command line application that can extract the
textual content from the PDF files.


That's assuming the text is in the PDF as a text object. If it's a
scan of a paper document the chances are that all you have is an
image, in which case you need to do OCR (optical character
recognition) or get someone to type it all in again.


Good point...a scanned image would certainly complicate matters. Even  
with OCR, you introduce the potential for error in the the translation  
of the image to text and risk formatting issues, which can lead to  
inconsistencies in page layouts.


Cheers,

Marc

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.