[R] Getting data from a PDF-file into R

2009-01-26 Thread joe1985

Hello

I have around 200 PDF-documents, containing data i want organized in R as a
dataframe. The PDF-documents look like this;

  http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg 

or like this;

http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg 

So i want to pull out the data in coloured boxes it become organized like
this (just in R instead of excel);


http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg 

So the 0'es and 1'es represent when either PRRS-neg occurs presented by a
0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with
PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and
PRRS-pos DK  or DK presented by a 1 in the colum PRRS-DK. And also with
sanVAC there should be a 1 in the colum VACsan, and with sanDK there
should be a 1 in the colum DKsan. The first date for each CHR-nr should
either be the earliest date ne the red box (as in the first picture), or the
date with word før before the date (as in the second picture). All the 200
PDF-documents looks like the ones in the pictures, each reprenting a
different CHR-nr


Hope you can help me
-- 
View this message in context: 
http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21667074.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting data from a PDF-file into R

2009-01-26 Thread Peter Dalgaard
joe1985 wrote:
 Hello
 
 I have around 200 PDF-documents, containing data i want organized in R as a
 dataframe. The PDF-documents look like this;
 
   http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg 
 
 or like this;
 
 http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg 
 
 So i want to pull out the data in coloured boxes it become organized like
 this (just in R instead of excel);
 
 
 http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg 
 
 So the 0'es and 1'es represent when either PRRS-neg occurs presented by a
 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with
 PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and
 PRRS-pos DK  or DK presented by a 1 in the colum PRRS-DK. And also with
 sanVAC there should be a 1 in the colum VACsan, and with sanDK there
 should be a 1 in the colum DKsan. The first date for each CHR-nr should
 either be the earliest date ne the red box (as in the first picture), or the
 date with word før before the date (as in the second picture). All the 200
 PDF-documents looks like the ones in the pictures, each reprenting a
 different CHR-nr
 
 
 Hope you can help me

Not on the basis of .jpeg files, I think. We'd need some indication of
what the PDF looks like inside.  There's a tool called pdftotext, which
might do something for you, IF you can figure out reliably where your
data begin and end.

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting data from a PDF-file into R

2009-01-26 Thread hadley wickham
On Mon, Jan 26, 2009 at 9:40 AM, Peter Dalgaard
p.dalga...@biostat.ku.dk wrote:
 joe1985 wrote:
 Hello

 I have around 200 PDF-documents, containing data i want organized in R as a
 dataframe. The PDF-documents look like this;

   http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg

 or like this;

 http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg

 So i want to pull out the data in coloured boxes it become organized like
 this (just in R instead of excel);


 http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg

 So the 0'es and 1'es represent when either PRRS-neg occurs presented by a
 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with
 PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and
 PRRS-pos DK  or DK presented by a 1 in the colum PRRS-DK. And also with
 sanVAC there should be a 1 in the colum VACsan, and with sanDK there
 should be a 1 in the colum DKsan. The first date for each CHR-nr should
 either be the earliest date ne the red box (as in the first picture), or the
 date with word før before the date (as in the second picture). All the 200
 PDF-documents looks like the ones in the pictures, each reprenting a
 different CHR-nr


 Hope you can help me

 Not on the basis of .jpeg files, I think. We'd need some indication of
 what the PDF looks like inside.  There's a tool called pdftotext, which
 might do something for you, IF you can figure out reliably where your
 data begin and end.

An alternative is to outsource the problem.  You can get very
reasonable data entry quotes from sites like http://www.elance.com/,
and depending on how much you value your time this might end up being
a much cheaper option than figuring out how to do it programmatically
(but not as intellectually satisfying).

Hadley

-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting data from a PDF-file into R

2009-01-26 Thread Henrique Dallazuanna
You can convert the pdf to text, then manipulate the output to read only the
data.

In linux has pdftotext function, in linux you can download the xpdf zip,
that contais such function.

Best


On 1/26/09, joe1985 johan...@dsr.life.ku.dk wrote:


 Hello

 I have around 200 PDF-documents, containing data i want organized in R as a
 dataframe. The PDF-documents look like this;

 http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg

 or like this;

 http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg

 So i want to pull out the data in coloured boxes it become organized like
 this (just in R instead of excel);


 http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg

 So the 0'es and 1'es represent when either PRRS-neg occurs presented by a
 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same
 with
 PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and
 PRRS-pos DK  or DK presented by a 1 in the colum PRRS-DK. And also with
 sanVAC there should be a 1 in the colum VACsan, and with sanDK there
 should be a 1 in the colum DKsan. The first date for each CHR-nr should
 either be the earliest date ne the red box (as in the first picture), or
 the
 date with word før before the date (as in the second picture). All the
 200
 PDF-documents looks like the ones in the pictures, each reprenting a
 different CHR-nr


 Hope you can help me
 --
 View this message in context:
 http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21667074.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40 S 49° 16' 22 O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting data from a PDF-file into R

2009-01-26 Thread joe1985




Peter Dalgaard wrote:
 
 joe1985 wrote:
 Hello
 
 I have around 200 PDF-documents, containing data i want organized in R as
 a
 dataframe. The PDF-documents look like this;
 
   http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg 
 
 or like this;
 
 http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg 
 
 So i want to pull out the data in coloured boxes it become organized like
 this (just in R instead of excel);
 
 
 http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg 
 
 So the 0'es and 1'es represent when either PRRS-neg occurs presented by
 a
 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same
 with
 PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and
 PRRS-pos DK  or DK presented by a 1 in the colum PRRS-DK. And also
 with
 sanVAC there should be a 1 in the colum VACsan, and with sanDK there
 should be a 1 in the colum DKsan. The first date for each CHR-nr should
 either be the earliest date ne the red box (as in the first picture), or
 the
 date with word før before the date (as in the second picture). All the
 200
 PDF-documents looks like the ones in the pictures, each reprenting a
 different CHR-nr
 
 
 Hope you can help me
 
 Not on the basis of .jpeg files, I think. We'd need some indication of
 what the PDF looks like inside.  There's a tool called pdftotext, which
 might do something for you, IF you can figure out reliably where your
 data begin and end.
 
 -- 
O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
 ~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 Thank you for your quick respons
 
 Here they are as textfiles;
 
 
 
 
http://www.nabble.com/file/p21680833/Foersom%2B-%2B688.txt Foersom+-+688.txt 

http://www.nabble.com/file/p21680833/M%25C3%2598LLEVANG%2B602%2B.txt
M%C3%98LLEVANG+602+.txt 
-- 
View this message in context: 
http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21680833.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.