[R] regexpr and parsing question

2007-01-30 Thread Kimpel, Mark William
The main problem I am trying to solve it this:

I am importing a tab delimited file whose first line contains only one
column, which is a descriptor of the form col_1 col_2 col_3, i.e. the
colnames are not tab delineated but are separated by whitespace. I would
like to parse this first line and make such that it becomes the colnames
of the rest of the file, which I am reading into R using read.delim().
The file is so huge that I must do this in R.

My first question is this: What is the best way to accomplish what I
want to do?

My other questions revolve around some failed attempts on my part to
solve the problem on my own using regular expressions. I thought that
perhaps I could change the first line to c(col_1, col_2, col_3)
using gsub. I was having trouble figuring out how R uses the backslash
character because I know that sometimes the backslash one would use in
Perl needs to be a double backslash in R.

Here is a sample of what I tried and what I got:

a-col_1 col_2 col_3

 gsub(\\s,   , a) 

[1] col_1 col_2 col_3

 gsub(\\s, \\s , a) 

[1] col_1scol_2scol_3

As you can see, it looks like R is taking a regular expression for
pattern, but not taking it for replacement. Why is this?

Assuming that I did want to solve my original problem with gsub and then
turn the string into an R object, how would I get gsub to return
c(col_1, col_2, col_3) using my original string?

Finally, is there a way to declare a string as a regular expression so
that R sees it the same way other languages, such as Perl do, i.e. make
the backslash be interpreted the same way? For someone who is just
learning regular expressions as I am, it is very frustrating to read
about them in references and then have to translate what I've learned
into R syntax. I was thinking that instead of enclosing the string in
, one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we
use I() in formulae.

These are a bunch of questions, but obviously I have a lot to learn!

Thanks,

Mark

Mark W. Kimpel MD 

 

(317) 490-5129 Work,  Mobile

 

(317) 663-0513 Home (no voice mail please)

1-(317)-536-2730 FAX

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regexpr and parsing question

2007-01-30 Thread Gabor Grothendieck
Both spaces and tabs are whitespace so this
should be good enough (unless you can
have empty fields):

read.table(myfile.dat, header = TRUE)

See the sep= argument in ?read.table .

Although I don't think you really need this, here are
some regular expressions for processing a header
into the form you asked for.  The first line places
quotes around the names, the second one inserts
commas and the last one adds c( and ).

s - gsub('(\\S+)', '\\1', 'col1 col2 col3')
s - gsub((\\S+) , \\1, , s)
sub((.*), c(\\1), s)


On 1/30/07, Kimpel, Mark William [EMAIL PROTECTED] wrote:
 The main problem I am trying to solve it this:

 I am importing a tab delimited file whose first line contains only one
 column, which is a descriptor of the form col_1 col_2 col_3, i.e. the
 colnames are not tab delineated but are separated by whitespace. I would
 like to parse this first line and make such that it becomes the colnames
 of the rest of the file, which I am reading into R using read.delim().
 The file is so huge that I must do this in R.

 My first question is this: What is the best way to accomplish what I
 want to do?

 My other questions revolve around some failed attempts on my part to
 solve the problem on my own using regular expressions. I thought that
 perhaps I could change the first line to c(col_1, col_2, col_3)
 using gsub. I was having trouble figuring out how R uses the backslash
 character because I know that sometimes the backslash one would use in
 Perl needs to be a double backslash in R.

 Here is a sample of what I tried and what I got:

 a-col_1 col_2 col_3

  gsub(\\s,   , a)

 [1] col_1 col_2 col_3

  gsub(\\s, \\s , a)

 [1] col_1scol_2scol_3

 As you can see, it looks like R is taking a regular expression for
 pattern, but not taking it for replacement. Why is this?

 Assuming that I did want to solve my original problem with gsub and then
 turn the string into an R object, how would I get gsub to return
 c(col_1, col_2, col_3) using my original string?

 Finally, is there a way to declare a string as a regular expression so
 that R sees it the same way other languages, such as Perl do, i.e. make
 the backslash be interpreted the same way? For someone who is just
 learning regular expressions as I am, it is very frustrating to read
 about them in references and then have to translate what I've learned
 into R syntax. I was thinking that instead of enclosing the string in
 , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we
 use I() in formulae.

 These are a bunch of questions, but obviously I have a lot to learn!

 Thanks,

 Mark

 Mark W. Kimpel MD



 (317) 490-5129 Work,  Mobile



 (317) 663-0513 Home (no voice mail please)

 1-(317)-536-2730 FAX

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regexpr and parsing question

2007-01-30 Thread Marc Schwartz
On Tue, 2007-01-30 at 17:23 -0500, Kimpel, Mark William wrote:
 The main problem I am trying to solve it this:
 
 I am importing a tab delimited file whose first line contains only one
 column, which is a descriptor of the form col_1 col_2 col_3, i.e. the
 colnames are not tab delineated but are separated by whitespace. I would
 like to parse this first line and make such that it becomes the colnames
 of the rest of the file, which I am reading into R using read.delim().
 The file is so huge that I must do this in R.
 
 My first question is this: What is the best way to accomplish what I
 want to do?

Mark,

The first thing that comes to mind is a two pass approach on the file:

First pass: (using example file with your first line)

# Get the first line into a vector to set the colnames for the DF
# during the second pass
ColNames - unlist(read.table(test.txt, nrow = 1, as.is = TRUE))

 str(ColNames)
 Named chr [1:3] col_1 col_2 col_3
 - attr(*, names)= chr [1:3] V1 V2 V3


Second pass:

# Now read the rest of the file, skipping the first line
DF - read.delim(test.txt, skip = 1, col.names = ColNames)


I believe that should get you the full data set and set the colnames
based upon the first line. This should pretty much obviate the need for
everything below here.

 My other questions revolve around some failed attempts on my part to
 solve the problem on my own using regular expressions. I thought that
 perhaps I could change the first line to c(col_1, col_2, col_3)
 using gsub. I was having trouble figuring out how R uses the backslash
 character because I know that sometimes the backslash one would use in
 Perl needs to be a double backslash in R.

You would not want to change the first line as you have it above, as it
would not be parsed properly using read.table() family functions.

 Here is a sample of what I tried and what I got:
 
 a-col_1 col_2 col_3
 
  gsub(\\s,   , a) 
 
 [1] col_1 col_2 col_3
 
  gsub(\\s, \\s , a) 
 
 [1] col_1scol_2scol_3
 
 As you can see, it looks like R is taking a regular expression for
 pattern, but not taking it for replacement. Why is this?

There are various settings for how regex are interpreted by/within R.
See ?grep and note the various arguments to the functions there and how
they impact R's behavior here.

Also, note that there is a difference (to further complicate your
life...) between the characters that R displays by default using print()
and how they are displayed using cat(). See below.

 a
[1] col_1 col_2 col_3

 gsub( , ,  , a)
[1] col_1, col_2, col_3

or to get you to your vector statement above:

Note the result here:

 paste(c(\, gsub( , \, \ , a), \), sep = )
[1] c(\col_1\, \col_2\, \col_3\)


Now see how it displays when the escaped double quote chars are
interpreted properly using cat():

 cat(paste(c(\, gsub( , \, \ , a), \), sep = ), \n)
c(col_1, col_2, col_3) 


 Assuming that I did want to solve my original problem with gsub and then
 turn the string into an R object, how would I get gsub to return
 c(col_1, col_2, col_3) using my original string?

Again, note the two pass solution above.  It's easier, unless you would
want to consider using awk/sed from a CLI, which I generally avoid at
all costs...

 Finally, is there a way to declare a string as a regular expression so
 that R sees it the same way other languages, such as Perl do, i.e. make
 the backslash be interpreted the same way? For someone who is just
 learning regular expressions as I am, it is very frustrating to read
 about them in references and then have to translate what I've learned
 into R syntax. I was thinking that instead of enclosing the string in
 , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we
 use I() in formulae.

Part of the challenge is noting the different behaviors of regex within
R and how that behavior is affected by the aforementioned arguments.
Also, noting how the output is displayed within R relative to the
interpretation of escaped characters as is seen above.

 These are a bunch of questions, but obviously I have a lot to learn!
 
 Thanks,
 
 Mark


HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regexpr and parsing question

2007-01-30 Thread Gabor Grothendieck
And here is an alternative to the regular expressions (although again
I don't think you really need any of this):

 capture.output(dput(strsplit(col1 col2 col3,  )[[1]]))
[1] c(\col1\, \col2\, \col3\)

On 1/30/07, Gabor Grothendieck [EMAIL PROTECTED] wrote:
 Both spaces and tabs are whitespace so this
 should be good enough (unless you can
 have empty fields):

 read.table(myfile.dat, header = TRUE)

 See the sep= argument in ?read.table .

 Although I don't think you really need this, here are
 some regular expressions for processing a header
 into the form you asked for.  The first line places
 quotes around the names, the second one inserts
 commas and the last one adds c( and ).

 s - gsub('(\\S+)', '\\1', 'col1 col2 col3')
 s - gsub((\\S+) , \\1, , s)
 sub((.*), c(\\1), s)


 On 1/30/07, Kimpel, Mark William [EMAIL PROTECTED] wrote:
  The main problem I am trying to solve it this:
 
  I am importing a tab delimited file whose first line contains only one
  column, which is a descriptor of the form col_1 col_2 col_3, i.e. the
  colnames are not tab delineated but are separated by whitespace. I would
  like to parse this first line and make such that it becomes the colnames
  of the rest of the file, which I am reading into R using read.delim().
  The file is so huge that I must do this in R.
 
  My first question is this: What is the best way to accomplish what I
  want to do?
 
  My other questions revolve around some failed attempts on my part to
  solve the problem on my own using regular expressions. I thought that
  perhaps I could change the first line to c(col_1, col_2, col_3)
  using gsub. I was having trouble figuring out how R uses the backslash
  character because I know that sometimes the backslash one would use in
  Perl needs to be a double backslash in R.
 
  Here is a sample of what I tried and what I got:
 
  a-col_1 col_2 col_3
 
   gsub(\\s,   , a)
 
  [1] col_1 col_2 col_3
 
   gsub(\\s, \\s , a)
 
  [1] col_1scol_2scol_3
 
  As you can see, it looks like R is taking a regular expression for
  pattern, but not taking it for replacement. Why is this?
 
  Assuming that I did want to solve my original problem with gsub and then
  turn the string into an R object, how would I get gsub to return
  c(col_1, col_2, col_3) using my original string?
 
  Finally, is there a way to declare a string as a regular expression so
  that R sees it the same way other languages, such as Perl do, i.e. make
  the backslash be interpreted the same way? For someone who is just
  learning regular expressions as I am, it is very frustrating to read
  about them in references and then have to translate what I've learned
  into R syntax. I was thinking that instead of enclosing the string in
  , one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we
  use I() in formulae.
 
  These are a bunch of questions, but obviously I have a lot to learn!
 
  Thanks,
 
  Mark
 
  Mark W. Kimpel MD
 
 
 
  (317) 490-5129 Work,  Mobile
 
 
 
  (317) 663-0513 Home (no voice mail please)
 
  1-(317)-536-2730 FAX
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.