Re: [R] Filtering an Entire Dataset based on Several Conditions

2022-05-09 Thread Avi Gross via R-help
Paul,

I read through the public replies you received and clearly some of us were not 
too clear on what you asked. Your subject line was not helpful as my first 
thought was that you wanted a single column examined for two conditions, as in 
EITHER less than 3 standard deviations above the mean OR more than three 
standard deviations below the mean. As someone else noted, using abs(whatever) 
makes it easy to do with a single condition. 

You could have simply said you want to remove outliers more than 3 standard 
deviations from the mean! I note in standard normally distributed data, the 
ones you want to exclude may be only 0.3% but in 39 such columns, quite a few 
rows will be an issue.

The voluminous data you shared makes it clear you have 39 columns of this. So 
it seems what you meant was you wanted to apply the same logic to all 39 
columns and maybe that is what you meant by multiple conditions.

But what result do you want? Do you want to remove rows that are all-out of 
bounds or even a single one outside? Do you want to remove the row or just mark 
the outlier, perhaps with an NA?

Some have suggested you convert to a matrix where many tools are available, and 
back to a data.frame if needed. I note your solution becomes fairly trivial if 
you convert any values above 3.0 to an NA and then use complete.cases( to 
remove any NA rows. This assumes, of course, you have no NA to start with.

There are all kinds of ways to do things and if you were using the dplyr 
package from the tidyverse, which we are discouraged from talking about here, I 
can see possibilities including the rowwise() function.

One idea to consider is that you can use the max() and min() functions applied 
to rows or columns. You can convert your data (perhaps as a matrix) using t() 
and remove any rows where the max of the absolute value of "all row elements" 
exceeds 3. 

One thought is to re-scale your data again using a function that is a bit like 
a truncated normal distribution but instead of tossing outliers, it returns an 
NA. As noted, then complete.cases() handles your need. But, as noted, 

But in your case, all your columns are numeric and the same so fairly trivial 
code like: 

mydf[abs(mydf ) > 3] <- NA

mydf <- complete.cases(mydf)

just might do it for you.

Good luck!

-Original Message-
From: Paul Bernal 
To: Rui Barradas 
Cc: R 
Sent: Mon, May 9, 2022 12:44 pm
Subject: Re: [R] Filtering an Entire Dataset based on Several Conditions

Dear Rui,

I was trying to dput() the datasets I am working on, but since it is a bit
large (42,000 rows by 60 columns) couldn´t retrieve all the structure of
the data to include it here, so I am attaching a couple of files. One is
the raw data (called trainFeatures42k), which is the data I need to
normalize, and the other is normalized_Data, which is the data normalized
(or at least I think I got to normalize it).

 Normalized_Data.csv
<https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web>
 trainFeatures42k.xls
<https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>

I have tried some of the code you and other friends from the community have
kindly shared, but have not been able to filter values > -3 and < 3.

Thank you all for your valuable help always.
Best,
Paul

El lun, 9 may 2022 a las 4:22, Rui Barradas ()
escribió:

> Hello,
>
> Something like this?
> First normalize the data.
> Then a apply loop creates a logical matrix giving which numbers are in
> the range -3 to 3.
> If they are all TRUE then their sum by rows is equal to the number of
> columns. This creates a logical index i.
> Use that index i to subset the scaled data set.
>
> # test data set, remove the Species column (not numeric)
> df1 <- iris[-5]
>
> df1_norm <- scale(df1)
> i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
>
> # returns a matrix
> df1_norm[i, ]
>
> # returns a data.frame
> as.data.frame(df1_norm[i,])
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 09:23 de 09/05/2022, Paul Bernal escreveu:
> > Dear friends,
> >
> > I have a dataframe which every single (i,j) entry (i standing for ith
> row,
> > j for jth column) has been normalized (converted to z-scores).
> >
> > Now I want to filter or subset the dataframe so that I only end up with
> a a
> > dataframe containing only entries greater than -3 or less than 3.
> >
> > How could I accomplish this?
> >
> > Best,
> > Paul
> >
> >      [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-pr

Re: [R] Filtering an Entire Dataset based on Several Conditions

2022-05-09 Thread Bert Gunter
This is trivial, so perhaps there is a miscommunication. How do you want to
handle values outside your desired range? I would simply change them to NA
(see below), but perhaps you have something else in mind that you need to
describe more explicitly. Anyway, below is a simple example of what I
*think* you asked for. Apologies if I have misunderstood.

> set.seed(567)
> ## create a data frame with 3 columns and 5 rows from norm(0,3)
> d <- as.data.frame(lapply(rep(5,3), function(x)round(rnorm(x,0,3),2)))
> names(d) <- LETTERS[1:3]
> d
  A B C
1  1.97 -1.23 -3.41
2  1.02 -1.12 -2.27
3 -1.92 -6.37 -6.44
4 -4.32  0.18  4.08
5  0.66 -5.82 -0.81
> d[abs(d) > 3] <- NA
> d
  A B C
1  1.97 -1.23NA
2  1.02 -1.12 -2.27
3 -1.92NANA
4NA  0.18NA
5  0.66NA -0.81

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, May 9, 2022 at 9:44 AM Paul Bernal  wrote:

> Dear Rui,
>
> I was trying to dput() the datasets I am working on, but since it is a bit
> large (42,000 rows by 60 columns) couldn´t retrieve all the structure of
> the data to include it here, so I am attaching a couple of files. One is
> the raw data (called trainFeatures42k), which is the data I need to
> normalize, and the other is normalized_Data, which is the data normalized
> (or at least I think I got to normalize it).
>
>  Normalized_Data.csv
> <
> https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web
> >
>  trainFeatures42k.xls
> <
> https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web
> >
>
> I have tried some of the code you and other friends from the community have
> kindly shared, but have not been able to filter values > -3 and < 3.
>
> Thank you all for your valuable help always.
> Best,
> Paul
>
> El lun, 9 may 2022 a las 4:22, Rui Barradas ()
> escribió:
>
> > Hello,
> >
> > Something like this?
> > First normalize the data.
> > Then a apply loop creates a logical matrix giving which numbers are in
> > the range -3 to 3.
> > If they are all TRUE then their sum by rows is equal to the number of
> > columns. This creates a logical index i.
> > Use that index i to subset the scaled data set.
> >
> > # test data set, remove the Species column (not numeric)
> > df1 <- iris[-5]
> >
> > df1_norm <- scale(df1)
> > i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
> >
> > # returns a matrix
> > df1_norm[i, ]
> >
> > # returns a data.frame
> > as.data.frame(df1_norm[i,])
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Às 09:23 de 09/05/2022, Paul Bernal escreveu:
> > > Dear friends,
> > >
> > > I have a dataframe which every single (i,j) entry (i standing for ith
> > row,
> > > j for jth column) has been normalized (converted to z-scores).
> > >
> > > Now I want to filter or subset the dataframe so that I only end up with
> > a a
> > > dataframe containing only entries greater than -3 or less than 3.
> > >
> > > How could I accomplish this?
> > >
> > > Best,
> > > Paul
> > >
> > >   [[alternative HTML version deleted]]
> > >
> > > __
> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Filtering an Entire Dataset based on Several Conditions

2022-05-09 Thread Rui Barradas

Hello,

My code seems to work with your data, except that the first column is 
not to be scaled.




# file names
xlsfile <- file.path("~/dados", "trainFeatures42k.xls")
csvfile <- file.path("~/dados", "Normalized_Data.csv")
# read in the data files
df1 <- readxl::read_excel(xlsfile, col_names = FALSE)
df2 <- read.csv(csvfile)
# assign names to make all.equal happy
names(df1) <- sprintf("X%d", seq_len(ncol(df1)))
names(df2) <- sprintf("X%d", seq_len(ncol(df2)))

# the first column is not to be scaled
df1_norm <- scale(df1[-1])
# compare to the already scaled data from the Google Drive
# the data.frames are equal up to floating-point precision
identical(df2[-1], as.data.frame(df1_norm))
#[1] FALSE
all.equal(df2[-1], as.data.frame(df1_norm))
#[1] TRUE

# see if all values in each row are between -3 and 3
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

# filter and return a data.frame
df1_clean <- as.data.frame(df1_norm[i,])
dim(df1_clean)
#[1] 3249460


See if you get the same results.

Hope this helps,

Rui Barradas


Às 17:44 de 09/05/2022, Paul Bernal escreveu:

Dear Rui,

I was trying to dput() the datasets I am working on, but since it is a 
bit large (42,000 rows by 60 columns) couldn´t retrieve all the 
structure of the data to include it here, so I am attaching a couple of 
files. One is the raw data (called trainFeatures42k), which is the data 
I need to normalize, and the other is normalized_Data, which is the data 
normalized (or at least I think I got to normalize it).


Normalized_Data.csv 

trainFeatures42k.xls 



I have tried some of the code you and other friends from the community 
have kindly shared, but have not been able to filter values > -3 and < 3.


Thank you all for your valuable help always.
Best,
Paul

El lun, 9 may 2022 a las 4:22, Rui Barradas (>) escribió:


Hello,

Something like this?
First normalize the data.
Then a apply loop creates a logical matrix giving which numbers are in
the range -3 to 3.
If they are all TRUE then their sum by rows is equal to the number of
columns. This creates a logical index i.
Use that index i to subset the scaled data set.

# test data set, remove the Species column (not numeric)
df1 <- iris[-5]

df1_norm <- scale(df1)
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

# returns a matrix
df1_norm[i, ]

# returns a data.frame
as.data.frame(df1_norm[i,])


Hope this helps,

Rui Barradas

Às 09:23 de 09/05/2022, Paul Bernal escreveu:
 > Dear friends,
 >
 > I have a dataframe which every single (i,j) entry (i standing for
ith row,
 > j for jth column) has been normalized (converted to z-scores).
 >
 > Now I want to filter or subset the dataframe so that I only end
up with a a
 > dataframe containing only entries greater than -3 or less than 3.
 >
 > How could I accomplish this?
 >
 > Best,
 > Paul
 >
 >       [[alternative HTML version deleted]]
 >
 > __
 > R-help@r-project.org  mailing list
-- To UNSUBSCRIBE and more, see
 > https://stat.ethz.ch/mailman/listinfo/r-help

 > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html

 > and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Filtering an Entire Dataset based on Several Conditions

2022-05-09 Thread Rui Barradas

Hello,

Something like this?
First normalize the data.
Then a apply loop creates a logical matrix giving which numbers are in 
the range -3 to 3.
If they are all TRUE then their sum by rows is equal to the number of 
columns. This creates a logical index i.

Use that index i to subset the scaled data set.

# test data set, remove the Species column (not numeric)
df1 <- iris[-5]

df1_norm <- scale(df1)
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

# returns a matrix
df1_norm[i, ]

# returns a data.frame
as.data.frame(df1_norm[i,])


Hope this helps,

Rui Barradas

Às 09:23 de 09/05/2022, Paul Bernal escreveu:

Dear friends,

I have a dataframe which every single (i,j) entry (i standing for ith row,
j for jth column) has been normalized (converted to z-scores).

Now I want to filter or subset the dataframe so that I only end up with a a
dataframe containing only entries greater than -3 or less than 3.

How could I accomplish this?

Best,
Paul

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Filtering an Entire Dataset based on Several Conditions

2022-05-09 Thread Jim Lemon
Hi Paul,
Based on my guess that all values have been normalized, I would say:

mat<-(matrix(runif(16,-5,5),4))
df<-as.data.frame(mat)
df[abs(df) < 3]<-NA
df
   V1   V2   V3V4
1   NA 4.675699 3.166625NA
2   NA   NA   NA  3.463660
3 4.288831   NA 4.032902 -4.343869
4   NA   NA   NANA

Beware that this behavior of data frames may not persist in the
future. You may have to convert to a matrix and then convert the
result back again.

Jim

On Mon, May 9, 2022 at 6:24 PM Paul Bernal  wrote:
>
> Dear friends,
>
> I have a dataframe which every single (i,j) entry (i standing for ith row,
> j for jth column) has been normalized (converted to z-scores).
>
> Now I want to filter or subset the dataframe so that I only end up with a a
> dataframe containing only entries greater than -3 or less than 3.
>
> How could I accomplish this?
>
> Best,
> Paul
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.