[R] Getting data from a PDF-file into R
Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of excel); http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg So the 0'es and 1'es represent when either PRRS-neg occurs presented by a 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and PRRS-pos DK or DK presented by a 1 in the colum PRRS-DK. And also with sanVAC there should be a 1 in the colum VACsan, and with sanDK there should be a 1 in the colum DKsan. The first date for each CHR-nr should either be the earliest date ne the red box (as in the first picture), or the date with word før before the date (as in the second picture). All the 200 PDF-documents looks like the ones in the pictures, each reprenting a different CHR-nr Hope you can help me -- View this message in context: http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21667074.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting data from a PDF-file into R
joe1985 wrote: Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of excel); http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg So the 0'es and 1'es represent when either PRRS-neg occurs presented by a 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and PRRS-pos DK or DK presented by a 1 in the colum PRRS-DK. And also with sanVAC there should be a 1 in the colum VACsan, and with sanDK there should be a 1 in the colum DKsan. The first date for each CHR-nr should either be the earliest date ne the red box (as in the first picture), or the date with word før before the date (as in the second picture). All the 200 PDF-documents looks like the ones in the pictures, each reprenting a different CHR-nr Hope you can help me Not on the basis of .jpeg files, I think. We'd need some indication of what the PDF looks like inside. There's a tool called pdftotext, which might do something for you, IF you can figure out reliably where your data begin and end. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting data from a PDF-file into R
On Mon, Jan 26, 2009 at 9:40 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: joe1985 wrote: Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of excel); http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg So the 0'es and 1'es represent when either PRRS-neg occurs presented by a 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and PRRS-pos DK or DK presented by a 1 in the colum PRRS-DK. And also with sanVAC there should be a 1 in the colum VACsan, and with sanDK there should be a 1 in the colum DKsan. The first date for each CHR-nr should either be the earliest date ne the red box (as in the first picture), or the date with word før before the date (as in the second picture). All the 200 PDF-documents looks like the ones in the pictures, each reprenting a different CHR-nr Hope you can help me Not on the basis of .jpeg files, I think. We'd need some indication of what the PDF looks like inside. There's a tool called pdftotext, which might do something for you, IF you can figure out reliably where your data begin and end. An alternative is to outsource the problem. You can get very reasonable data entry quotes from sites like http://www.elance.com/, and depending on how much you value your time this might end up being a much cheaper option than figuring out how to do it programmatically (but not as intellectually satisfying). Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting data from a PDF-file into R
You can convert the pdf to text, then manipulate the output to read only the data. In linux has pdftotext function, in linux you can download the xpdf zip, that contais such function. Best On 1/26/09, joe1985 johan...@dsr.life.ku.dk wrote: Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of excel); http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg So the 0'es and 1'es represent when either PRRS-neg occurs presented by a 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and PRRS-pos DK or DK presented by a 1 in the colum PRRS-DK. And also with sanVAC there should be a 1 in the colum VACsan, and with sanDK there should be a 1 in the colum DKsan. The first date for each CHR-nr should either be the earliest date ne the red box (as in the first picture), or the date with word før before the date (as in the second picture). All the 200 PDF-documents looks like the ones in the pictures, each reprenting a different CHR-nr Hope you can help me -- View this message in context: http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21667074.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40 S 49° 16' 22 O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Getting data from a PDF-file into R
Peter Dalgaard wrote: joe1985 wrote: Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of excel); http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg So the 0'es and 1'es represent when either PRRS-neg occurs presented by a 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with PRRS-pos VAC or Vac presented by a 1 in the colum PRRS-VAC, and PRRS-pos DK or DK presented by a 1 in the colum PRRS-DK. And also with sanVAC there should be a 1 in the colum VACsan, and with sanDK there should be a 1 in the colum DKsan. The first date for each CHR-nr should either be the earliest date ne the red box (as in the first picture), or the date with word før before the date (as in the second picture). All the 200 PDF-documents looks like the ones in the pictures, each reprenting a different CHR-nr Hope you can help me Not on the basis of .jpeg files, I think. We'd need some indication of what the PDF looks like inside. There's a tool called pdftotext, which might do something for you, IF you can figure out reliably where your data begin and end. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thank you for your quick respons Here they are as textfiles; http://www.nabble.com/file/p21680833/Foersom%2B-%2B688.txt Foersom+-+688.txt http://www.nabble.com/file/p21680833/M%25C3%2598LLEVANG%2B602%2B.txt M%C3%98LLEVANG+602+.txt -- View this message in context: http://www.nabble.com/Getting-data-from-a-PDF-file-into-R-tp21667074p21680833.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.