Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-08 Thread Andrew Choens
Interesting. Thanks.

On Sat, 2009-02-07 at 02:36 +0100, Wacek Kusnierczyk wrote:
 Andrew Choens wrote:
  I regularly deal with a similar pattern at work. People send me these
  big long .csv files and I have to run them through some pattern analysis
  to decide which rows I keep and which rows I kill off.
 
  As others have mentioned, Perl is a good candidate for this task.
  Another option would be a quick SQL query. It should be a snap to pull
  this into something like Access or OOo Base . . . . or better yet,  a
  real database like Postgres, MySQL, etc.
 
  In case you aren't too familiar with SQL, this query could be done by
  deleting the rows using a self join (syntax varies by product).
 
  But, if the pattern is as simple as it sounds and / or this is a
  one-time job, using SQL is over-kill for the situation.
 
  I often use sed in places where Perl is over-kill, but I can't think of
  any way to match from row to row with sed. If anyone knows how to do
  this with sed, it would (probably) be easier than trying to learn how to
  use perl. And, I would like to know how to do this with sed too.
 

 
 (this is actually off-topic, but since it may be interesting for the
 general public, i keep the response cc: to r-help)
 
 yes, you can do this with sed.  suppose you have two files, one (say,
 sample.txt) with the data to be filtered, record fields separated by,
 e.g., a tab character, and another (say, filter.txt) with patterns to be
 matched.  a row from the first is passed to output only of its second
 field does not match any of the patterns -- this corresponds to (a
 simplified version of) the original problem.
 
 then, the following should do:
 
 sed $(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt) sample.txt 
 filtered-sample.txt
 
 (unless the patterns contain characters that interfere with the shell or
 sed's syntax, in which case they'd have to be appropriately escaped.)
 
 vQ
 
 
 
 
 
-- 
This is the price and the promise of citizenship.
-- Barack Obama, 44th President of the United States

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Wacek Kusnierczyk
Laura Rodriguez Murillo wrote:
 Hi,

 I'm new in the mailing list but I would appreciate if you could help
 me with this:
 I have a big matrix from where I need to delete specific rows. The
 second entry on these rows to delete should match any string within a
 list (other file with just one column).
 Thank you so much!

   

here's one way to do it, illustrated with dummy data:

# dummy character matrix
data = matrix(replicate(20, paste(sample(letters, 20), collapse=)),
ncol=2)

# filter out rows where second column does not match 'a'
data[-grep('a', d[,2]),]

this will work also if your data is actually a data frame:

data = as.data.frame(data)
data[-grep('a', d[,2]),]

note, due to a known issue with grep, this won't work correctly if there
are *no* rows that do *not* match the pattern:

data[-grep('1', d[,2]),]
# should return all of data, but returns an empty matrix

with the upcoming version of r, grep will have an additional argument
which will make this problem easy to fix:

data[grep('a', d[,2], invert=TRUE),]


vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Laura Rodriguez Murillo
Thank you. I think grep would do it, but the list of expressions I
need to match is too long so they are stored in a file. So the
question would be how I can tell R to look into that file to look for
the expressions that I want to match.

Thank you again for your help

Laura

2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:
 Laura Rodriguez Murillo wrote:
 Hi,

 I'm new in the mailing list but I would appreciate if you could help
 me with this:
 I have a big matrix from where I need to delete specific rows. The
 second entry on these rows to delete should match any string within a
 list (other file with just one column).
 Thank you so much!



 here's one way to do it, illustrated with dummy data:

 # dummy character matrix
 data = matrix(replicate(20, paste(sample(letters, 20), collapse=)),
 ncol=2)

 # filter out rows where second column does not match 'a'
 data[-grep('a', d[,2]),]

 this will work also if your data is actually a data frame:

 data = as.data.frame(data)
 data[-grep('a', d[,2]),]

 note, due to a known issue with grep, this won't work correctly if there
 are *no* rows that do *not* match the pattern:

 data[-grep('1', d[,2]),]
 # should return all of data, but returns an empty matrix

 with the upcoming version of r, grep will have an additional argument
 which will make this problem easy to fix:

 data[grep('a', d[,2], invert=TRUE),]


 vQ


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Wacek Kusnierczyk
Laura Rodriguez Murillo wrote:
 Thank you. I think grep would do it, but the list of expressions I
 need to match is too long so they are stored in a file. 

what does 'too long' mean?

 So the
 question would be how I can tell R to look into that file to look for
 the expressions that I want to match.
   

i guess you may still successfully use r for this, but to me it sounds
like a perfect job for perl.  let me know if you need more help. 

note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd'
instead of 'data').  sorry for the typo.  mark, thanks for pointing this
out -- the more obvious the mistake, the less visible ;)

vQ


 Thank you again for your help

 Laura

 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:
   
 Laura Rodriguez Murillo wrote:
 
 Hi,

 I'm new in the mailing list but I would appreciate if you could help
 me with this:
 I have a big matrix from where I need to delete specific rows. The
 second entry on these rows to delete should match any string within a
 list (other file with just one column).
 Thank you so much!


   
 here's one way to do it, illustrated with dummy data:

 # dummy character matrix
 data = matrix(replicate(20, paste(sample(letters, 20), collapse=)),
 ncol=2)

 # filter out rows where second column does not match 'a'
 data[-grep('a', d[,2]),]

 this will work also if your data is actually a data frame:

 data = as.data.frame(data)
 data[-grep('a', d[,2]),]

 note, due to a known issue with grep, this won't work correctly if there
 are *no* rows that do *not* match the pattern:

 data[-grep('1', d[,2]),]
 # should return all of data, but returns an empty matrix

 with the upcoming version of r, grep will have an additional argument
 which will make this problem easy to fix:

 data[grep('a', d[,2], invert=TRUE),]


 vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Laura Rodriguez Murillo
yep, it definitely sounds like a work for perl, but I don't know perl
(unfortunately). I'm still stuck with this so I'm giving more details
in case it helps:

I have file A with 382 columns and 30 rows. There are rows where
only the entry in first column is duplicated in other rows. In these
cases, I need to delete the entire row.

I also have a file B (one column and around 28 rows) with a list
of the entries that are repeated. So I was trying to look for the ones
that match and get rid of the entire row.

Thank you!

Laura

2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:
 Laura Rodriguez Murillo wrote:
 Thank you. I think grep would do it, but the list of expressions I
 need to match is too long so they are stored in a file.

 what does 'too long' mean?

 So the
 question would be how I can tell R to look into that file to look for
 the expressions that I want to match.


 i guess you may still successfully use r for this, but to me it sounds
 like a perfect job for perl.  let me know if you need more help.

 note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd'
 instead of 'data').  sorry for the typo.  mark, thanks for pointing this
 out -- the more obvious the mistake, the less visible ;)

 vQ


 Thank you again for your help

 Laura

 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:

 Laura Rodriguez Murillo wrote:

 Hi,

 I'm new in the mailing list but I would appreciate if you could help
 me with this:
 I have a big matrix from where I need to delete specific rows. The
 second entry on these rows to delete should match any string within a
 list (other file with just one column).
 Thank you so much!



 here's one way to do it, illustrated with dummy data:

 # dummy character matrix
 data = matrix(replicate(20, paste(sample(letters, 20), collapse=)),
 ncol=2)

 # filter out rows where second column does not match 'a'
 data[-grep('a', d[,2]),]

 this will work also if your data is actually a data frame:

 data = as.data.frame(data)
 data[-grep('a', d[,2]),]

 note, due to a known issue with grep, this won't work correctly if there
 are *no* rows that do *not* match the pattern:

 data[-grep('1', d[,2]),]
 # should return all of data, but returns an empty matrix

 with the upcoming version of r, grep will have an additional argument
 which will make this problem easy to fix:

 data[grep('a', d[,2], invert=TRUE),]


 vQ



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Andrew Choens
I regularly deal with a similar pattern at work. People send me these
big long .csv files and I have to run them through some pattern analysis
to decide which rows I keep and which rows I kill off.

As others have mentioned, Perl is a good candidate for this task.
Another option would be a quick SQL query. It should be a snap to pull
this into something like Access or OOo Base . . . . or better yet,  a
real database like Postgres, MySQL, etc.

In case you aren't too familiar with SQL, this query could be done by
deleting the rows using a self join (syntax varies by product).

But, if the pattern is as simple as it sounds and / or this is a
one-time job, using SQL is over-kill for the situation.

I often use sed in places where Perl is over-kill, but I can't think of
any way to match from row to row with sed. If anyone knows how to do
this with sed, it would (probably) be easier than trying to learn how to
use perl. And, I would like to know how to do this with sed too.


On Fri, 2009-02-06 at 16:04 -0500, Laura Rodriguez Murillo wrote:
 yep, it definitely sounds like a work for perl, but I don't know perl
 (unfortunately). I'm still stuck with this so I'm giving more details
 in case it helps:
 
 I have file A with 382 columns and 30 rows. There are rows where
 only the entry in first column is duplicated in other rows. In these
 cases, I need to delete the entire row.
 
 I also have a file B (one column and around 28 rows) with a list
 of the entries that are repeated. So I was trying to look for the ones
 that match and get rid of the entire row.
 
 Thank you!
 
 Laura
 
 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:
  Laura Rodriguez Murillo wrote:
  Thank you. I think grep would do it, but the list of expressions I
  need to match is too long so they are stored in a file.
 
  what does 'too long' mean?
 
  So the
  question would be how I can tell R to look into that file to look for
  the expressions that I want to match.
 
 
  i guess you may still successfully use r for this, but to me it sounds
  like a perfect job for perl.  let me know if you need more help.
 
  note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd'
  instead of 'data').  sorry for the typo.  mark, thanks for pointing this
  out -- the more obvious the mistake, the less visible ;)
 
  vQ
 
 
  Thank you again for your help
 
  Laura
 
  2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no:
 
  Laura Rodriguez Murillo wrote:
 
  Hi,
 
  I'm new in the mailing list but I would appreciate if you could help
  me with this:
  I have a big matrix from where I need to delete specific rows. The
  second entry on these rows to delete should match any string within a
  list (other file with just one column).
  Thank you so much!
 
 
 
  here's one way to do it, illustrated with dummy data:
 
  # dummy character matrix
  data = matrix(replicate(20, paste(sample(letters, 20), collapse=)),
  ncol=2)
 
  # filter out rows where second column does not match 'a'
  data[-grep('a', d[,2]),]
 
  this will work also if your data is actually a data frame:
 
  data = as.data.frame(data)
  data[-grep('a', d[,2]),]
 
  note, due to a known issue with grep, this won't work correctly if there
  are *no* rows that do *not* match the pattern:
 
  data[-grep('1', d[,2]),]
  # should return all of data, but returns an empty matrix
 
  with the upcoming version of r, grep will have an additional argument
  which will make this problem easy to fix:
 
  data[grep('a', d[,2], invert=TRUE),]
 
 
  vQ
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
-- 
This is the price and the promise of citizenship.
-- Barack Obama, 44th President of the United States

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Sebastien Bihorel

Hi Laura,

You might want to read the manual on Data importation and exportation on 
the cran webpage http://cran.r-project.org/

Otherwise, have a look at ?read.table.

Sebastien

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Laura Rodriguez Murillo
Thank you so much! I finally got it.

Laura

2009/2/6 Sebastien Bihorel sebastien.biho...@cognigencorp.com:
 Hi Laura,

 You might want to read the manual on Data importation and exportation on the
 cran webpage http://cran.r-project.org/
 Otherwise, have a look at ?read.table.

 Sebastien


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list

2009-02-06 Thread Wacek Kusnierczyk
Andrew Choens wrote:
 I regularly deal with a similar pattern at work. People send me these
 big long .csv files and I have to run them through some pattern analysis
 to decide which rows I keep and which rows I kill off.

 As others have mentioned, Perl is a good candidate for this task.
 Another option would be a quick SQL query. It should be a snap to pull
 this into something like Access or OOo Base . . . . or better yet,  a
 real database like Postgres, MySQL, etc.

 In case you aren't too familiar with SQL, this query could be done by
 deleting the rows using a self join (syntax varies by product).

 But, if the pattern is as simple as it sounds and / or this is a
 one-time job, using SQL is over-kill for the situation.

 I often use sed in places where Perl is over-kill, but I can't think of
 any way to match from row to row with sed. If anyone knows how to do
 this with sed, it would (probably) be easier than trying to learn how to
 use perl. And, I would like to know how to do this with sed too.

   

(this is actually off-topic, but since it may be interesting for the
general public, i keep the response cc: to r-help)

yes, you can do this with sed.  suppose you have two files, one (say,
sample.txt) with the data to be filtered, record fields separated by,
e.g., a tab character, and another (say, filter.txt) with patterns to be
matched.  a row from the first is passed to output only of its second
field does not match any of the patterns -- this corresponds to (a
simplified version of) the original problem.

then, the following should do:

sed $(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt) sample.txt 
filtered-sample.txt

(unless the patterns contain characters that interfere with the shell or
sed's syntax, in which case they'd have to be appropriately escaped.)

vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.