[R] regexpr and parsing question
The main problem I am trying to solve it this: I am importing a tab delimited file whose first line contains only one column, which is a descriptor of the form col_1 col_2 col_3, i.e. the colnames are not tab delineated but are separated by whitespace. I would like to parse this first line and make such that it becomes the colnames of the rest of the file, which I am reading into R using read.delim(). The file is so huge that I must do this in R. My first question is this: What is the best way to accomplish what I want to do? My other questions revolve around some failed attempts on my part to solve the problem on my own using regular expressions. I thought that perhaps I could change the first line to c(col_1, col_2, col_3) using gsub. I was having trouble figuring out how R uses the backslash character because I know that sometimes the backslash one would use in Perl needs to be a double backslash in R. Here is a sample of what I tried and what I got: a-col_1 col_2 col_3 gsub(\\s, , a) [1] col_1 col_2 col_3 gsub(\\s, \\s , a) [1] col_1scol_2scol_3 As you can see, it looks like R is taking a regular expression for pattern, but not taking it for replacement. Why is this? Assuming that I did want to solve my original problem with gsub and then turn the string into an R object, how would I get gsub to return c(col_1, col_2, col_3) using my original string? Finally, is there a way to declare a string as a regular expression so that R sees it the same way other languages, such as Perl do, i.e. make the backslash be interpreted the same way? For someone who is just learning regular expressions as I am, it is very frustrating to read about them in references and then have to translate what I've learned into R syntax. I was thinking that instead of enclosing the string in , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we use I() in formulae. These are a bunch of questions, but obviously I have a lot to learn! Thanks, Mark Mark W. Kimpel MD (317) 490-5129 Work, Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regexpr and parsing question
Both spaces and tabs are whitespace so this should be good enough (unless you can have empty fields): read.table(myfile.dat, header = TRUE) See the sep= argument in ?read.table . Although I don't think you really need this, here are some regular expressions for processing a header into the form you asked for. The first line places quotes around the names, the second one inserts commas and the last one adds c( and ). s - gsub('(\\S+)', '\\1', 'col1 col2 col3') s - gsub((\\S+) , \\1, , s) sub((.*), c(\\1), s) On 1/30/07, Kimpel, Mark William [EMAIL PROTECTED] wrote: The main problem I am trying to solve it this: I am importing a tab delimited file whose first line contains only one column, which is a descriptor of the form col_1 col_2 col_3, i.e. the colnames are not tab delineated but are separated by whitespace. I would like to parse this first line and make such that it becomes the colnames of the rest of the file, which I am reading into R using read.delim(). The file is so huge that I must do this in R. My first question is this: What is the best way to accomplish what I want to do? My other questions revolve around some failed attempts on my part to solve the problem on my own using regular expressions. I thought that perhaps I could change the first line to c(col_1, col_2, col_3) using gsub. I was having trouble figuring out how R uses the backslash character because I know that sometimes the backslash one would use in Perl needs to be a double backslash in R. Here is a sample of what I tried and what I got: a-col_1 col_2 col_3 gsub(\\s, , a) [1] col_1 col_2 col_3 gsub(\\s, \\s , a) [1] col_1scol_2scol_3 As you can see, it looks like R is taking a regular expression for pattern, but not taking it for replacement. Why is this? Assuming that I did want to solve my original problem with gsub and then turn the string into an R object, how would I get gsub to return c(col_1, col_2, col_3) using my original string? Finally, is there a way to declare a string as a regular expression so that R sees it the same way other languages, such as Perl do, i.e. make the backslash be interpreted the same way? For someone who is just learning regular expressions as I am, it is very frustrating to read about them in references and then have to translate what I've learned into R syntax. I was thinking that instead of enclosing the string in , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we use I() in formulae. These are a bunch of questions, but obviously I have a lot to learn! Thanks, Mark Mark W. Kimpel MD (317) 490-5129 Work, Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regexpr and parsing question
On Tue, 2007-01-30 at 17:23 -0500, Kimpel, Mark William wrote: The main problem I am trying to solve it this: I am importing a tab delimited file whose first line contains only one column, which is a descriptor of the form col_1 col_2 col_3, i.e. the colnames are not tab delineated but are separated by whitespace. I would like to parse this first line and make such that it becomes the colnames of the rest of the file, which I am reading into R using read.delim(). The file is so huge that I must do this in R. My first question is this: What is the best way to accomplish what I want to do? Mark, The first thing that comes to mind is a two pass approach on the file: First pass: (using example file with your first line) # Get the first line into a vector to set the colnames for the DF # during the second pass ColNames - unlist(read.table(test.txt, nrow = 1, as.is = TRUE)) str(ColNames) Named chr [1:3] col_1 col_2 col_3 - attr(*, names)= chr [1:3] V1 V2 V3 Second pass: # Now read the rest of the file, skipping the first line DF - read.delim(test.txt, skip = 1, col.names = ColNames) I believe that should get you the full data set and set the colnames based upon the first line. This should pretty much obviate the need for everything below here. My other questions revolve around some failed attempts on my part to solve the problem on my own using regular expressions. I thought that perhaps I could change the first line to c(col_1, col_2, col_3) using gsub. I was having trouble figuring out how R uses the backslash character because I know that sometimes the backslash one would use in Perl needs to be a double backslash in R. You would not want to change the first line as you have it above, as it would not be parsed properly using read.table() family functions. Here is a sample of what I tried and what I got: a-col_1 col_2 col_3 gsub(\\s, , a) [1] col_1 col_2 col_3 gsub(\\s, \\s , a) [1] col_1scol_2scol_3 As you can see, it looks like R is taking a regular expression for pattern, but not taking it for replacement. Why is this? There are various settings for how regex are interpreted by/within R. See ?grep and note the various arguments to the functions there and how they impact R's behavior here. Also, note that there is a difference (to further complicate your life...) between the characters that R displays by default using print() and how they are displayed using cat(). See below. a [1] col_1 col_2 col_3 gsub( , , , a) [1] col_1, col_2, col_3 or to get you to your vector statement above: Note the result here: paste(c(\, gsub( , \, \ , a), \), sep = ) [1] c(\col_1\, \col_2\, \col_3\) Now see how it displays when the escaped double quote chars are interpreted properly using cat(): cat(paste(c(\, gsub( , \, \ , a), \), sep = ), \n) c(col_1, col_2, col_3) Assuming that I did want to solve my original problem with gsub and then turn the string into an R object, how would I get gsub to return c(col_1, col_2, col_3) using my original string? Again, note the two pass solution above. It's easier, unless you would want to consider using awk/sed from a CLI, which I generally avoid at all costs... Finally, is there a way to declare a string as a regular expression so that R sees it the same way other languages, such as Perl do, i.e. make the backslash be interpreted the same way? For someone who is just learning regular expressions as I am, it is very frustrating to read about them in references and then have to translate what I've learned into R syntax. I was thinking that instead of enclosing the string in , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we use I() in formulae. Part of the challenge is noting the different behaviors of regex within R and how that behavior is affected by the aforementioned arguments. Also, noting how the output is displayed within R relative to the interpretation of escaped characters as is seen above. These are a bunch of questions, but obviously I have a lot to learn! Thanks, Mark HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regexpr and parsing question
And here is an alternative to the regular expressions (although again I don't think you really need any of this): capture.output(dput(strsplit(col1 col2 col3, )[[1]])) [1] c(\col1\, \col2\, \col3\) On 1/30/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Both spaces and tabs are whitespace so this should be good enough (unless you can have empty fields): read.table(myfile.dat, header = TRUE) See the sep= argument in ?read.table . Although I don't think you really need this, here are some regular expressions for processing a header into the form you asked for. The first line places quotes around the names, the second one inserts commas and the last one adds c( and ). s - gsub('(\\S+)', '\\1', 'col1 col2 col3') s - gsub((\\S+) , \\1, , s) sub((.*), c(\\1), s) On 1/30/07, Kimpel, Mark William [EMAIL PROTECTED] wrote: The main problem I am trying to solve it this: I am importing a tab delimited file whose first line contains only one column, which is a descriptor of the form col_1 col_2 col_3, i.e. the colnames are not tab delineated but are separated by whitespace. I would like to parse this first line and make such that it becomes the colnames of the rest of the file, which I am reading into R using read.delim(). The file is so huge that I must do this in R. My first question is this: What is the best way to accomplish what I want to do? My other questions revolve around some failed attempts on my part to solve the problem on my own using regular expressions. I thought that perhaps I could change the first line to c(col_1, col_2, col_3) using gsub. I was having trouble figuring out how R uses the backslash character because I know that sometimes the backslash one would use in Perl needs to be a double backslash in R. Here is a sample of what I tried and what I got: a-col_1 col_2 col_3 gsub(\\s, , a) [1] col_1 col_2 col_3 gsub(\\s, \\s , a) [1] col_1scol_2scol_3 As you can see, it looks like R is taking a regular expression for pattern, but not taking it for replacement. Why is this? Assuming that I did want to solve my original problem with gsub and then turn the string into an R object, how would I get gsub to return c(col_1, col_2, col_3) using my original string? Finally, is there a way to declare a string as a regular expression so that R sees it the same way other languages, such as Perl do, i.e. make the backslash be interpreted the same way? For someone who is just learning regular expressions as I am, it is very frustrating to read about them in references and then have to translate what I've learned into R syntax. I was thinking that instead of enclosing the string in , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we use I() in formulae. These are a bunch of questions, but obviously I have a lot to learn! Thanks, Mark Mark W. Kimpel MD (317) 490-5129 Work, Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.