[R] How to pass variable column name into R function
I know this has been explained a few times here in different scenarios, but I am having a hard time digesting this. The following code works fine as long as it's not inside a function (see below). df$season - as.character(df$season) temp - model.matrix( ~ season - 1, data=df) df - cbind(df,temp) BEFORE: head(df[c(1,2)]) datetime season 1 2011-01-01 1 2 2011-01-01 1 3 2011-01-01 1 4 2011-01-01 1 5 2011-01-01 1 6 2011-01-01 1 AFTER: head(df[c(1,2,13:16)]) datetime season season1 season2 season3 season4 1 2011-01-01 1 1 0 0 0 2 2011-01-01 1 1 0 0 0 3 2011-01-01 1 1 0 0 0 4 2011-01-01 1 1 0 0 0 5 2011-01-01 1 1 0 0 0 6 2011-01-01 1 1 0 0 0 However, when I try to wrap it in a multi-use function: binarize - function(data, myvar) { data$myvar - as.character(data$myvar) temp - model.matrix( ~ myvar - 1, data=data) data - cbind(data,temp) } it throws an error, undoubtedly because it cannot evaluate myvar or data (or both?): Error in $-.data.frame(*tmp*, myvar, value = character(0)) : replacement has 0 rows, data has 10886 I've tried experimenting with eval(substitute()) but still it's not working. My ideal end-state is that you start with a dataframe and a variable, have the function map all of the values for the selected variable into separate binary columns and append that to the original dataframe. Again, when it's not in a function it works perfectly. Here's the dput data if it helps to reproduce. dput(head(df,50)) structure(list(datetime = structure(c(14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14977, 14977, 14977), class = Date), season = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), holiday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), workingday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,! 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), weather = c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), temp = c(9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.2, 9.84, 13.12, 15.58, 14.76, 17.22, 18.86, 18.86, 18.04, 17.22, 18.04, 17.22, 17.22, 16.4, 16.4, 16.4, 18.86, 18.86, 18.04, 17.22, 18.86, 18.86, 17.22, 16.4, 16.4, 15.58, 14.76, 14.76, 14.76, 14.76, 14.76, 13.94, 13.94, 13.94, 14.76, 13.12, 12.3, 10.66, 9.84, 9.02, 9.02, 8.2, 6.56), atemp = c(14.395, 13.635, 13.635, 14.395, 14.395, 12.88, 13.635, 12.88, 14.395, 17.425, 19.695, 16.665, 21.21, 22.725, 22.725, 21.97, 21.21, 21.97, 21.21, 21.21, 20.455, 20.455, 20.455, 22.725, 22.725, 21.97, 21.21, 22.725, 22.725, 21.21, 20.455, 20.455, 19.695, 17.425, 16.665, 16.665, 17.425, 17.425, 16.665, 16.665, 16.665, 16.665, 14.395, 13.635, 11.365, ! 10.605, 11.365, 9.85, 8.335, 6.82), humidity = c(81L, 80L, 80L, 75L, 7 5L, 75L, 80L, 86L, 75L, 76L, 76L, 81L, 77L, 72L, 72L, 77L, 82L, 82L, 88L, 88L, 87L, 87L, 94L, 88L, 88L, 94L, 100L, 94L, 94L, 77L, 76L, 71L, 76L, 81L, 71L, 66L, 66L, 76L, 81L, 71L, 57L, 46L, 42L, 39L, 44L, 44L, 47L, 44L, 44L, 47L), windspeed = c(0, 0, 0, 0, 0, 6.0032, 0, 0, 0, 0, 16.9979, 19.0012, 19.0012, 19.9995, 19.0012, 19.9995, 19.9995, 19.0012, 16.9979, 16.9979, 16.9979, 12.998, 15.0013, 19.9995, 19.9995, 16.9979, 19.0012, 12.998, 12.998, 19.9995, 12.998, 15.0013, 15.0013, 15.0013, 16.9979, 19.9995, 8.9981, 12.998, 11.0014, 11.0014, 12.998, 22.0028, 30.0026, 23.9994, 22.0028, 19.9995, 11.0014, 23.9994, 27.9993, 26.0027), casual = c(3L, 8L, 5L, 3L, 0L, 0L, 2L, 1L, 1L, 8L, 12L, 26L, 29L, 47L, 35L, 40L, 41L, 15L, 9L, 6L, 11L, 3L, 11L, 15L, 4L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 1L, 7L, 16L, 20L, 11L, 4L, 19L, 9L, 7L, 10L, 1L, 5L, 11L, 0L, 0L, 0L, 0L, 0L), registered = c(13L, 32L, 27L, 10L, 1L, 1L, 0L, 2L, 7L, 6L, 24L, 30L, 55L, 47L, 71L, 70L, 52L, 52L, 26L, 31L, 25L, 31L, 17L, 2! 4L, 13L, 16L, 8L, 4L, 1L, 2L, 1L, 8L, 19L, 46L, 54L, 73L, 64L, 55L, 55L, 67L, 58L, 43L, 29L, 17L, 20L, 9L, 8L, 5L, 2L, 1L), count = c(16L, 40L, 32L, 13L, 1L, 1L, 2L, 3L, 8L, 14L, 36L, 56L, 84L, 94L, 106L, 110L, 93L, 67L, 35L, 37L, 36L, 34L, 28L, 39L, 17L, 17L, 9L, 6L, 3L, 2L, 1L, 8L, 20L, 53L, 70L, 93L, 75L, 59L, 74L, 76L, 65L, 53L, 30L, 22L, 31L, 9L, 8L, 5L, 2L, 1L)), .Names = c(datetime, season, holiday, workingday, weather, temp, atemp,
[R] Updating a data frame based on if condition
I have a subset of data that I have identified as suspect (for example, the first name has excessive spaces, is longer than 35 characters or has a number). What I want to do is update the FNAME_SUSPECT field in mydata to TRUE if any of those conditions are met. Here's my data: dput(mydata) structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR, CS AND JEFF, 88, 4465891170098562, 1124211, LEWIS MARY KAY, KARL R O S, 5466181820076010, JULI0 C, WAYNE T., 1124211, 1124211, ROBERT B VIONA D, DENNIS and MARY SUE, BRIAN JOANNE, 1124211, RONALD and GAIL, Mike and Mary Lou, 31763006, 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE, 1124211, WAYNE LU ANN, SCOTT ANNA MARIE, 1124211, 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA S, 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA LUZ, G S COMPUTERIZED GRADING, 1124211, 1124211, 1124211, 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211, Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and A. Judith, MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER), FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L, 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L, 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L, 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L, 21L, 15L), FNAME_PATTERN = c(999, A,_AA,_AA_AAA_, 99, , 999, A___AAA, _A_A_A, , 9_A, A___A., 999, 999, AA_A__A_A, AA_AAA__AAA, A___AA, 999, AA_AAA__, _AAA__AAA, , 9, , _AAA__, A_AAA__A, 999, A__AA_AAA, A___A, 999, 999, 99, A,_AA_AAA_AAA, ___A, 99, , A_AA_AA_AAA, A_AA_AA_AAA, A__A__AAA, 999, 999, 999, 999, A_AA_AA_AAA, AA_AAA_AA_A, 999, AAA_A._AAA_A_A., 999, 999, A_A._AAA_A._AA, AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_ ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L, 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L, 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names = c(PERSON_FIRST_NAME, FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN, FNAME_TOKEN_COUNT ), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L, 25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L, 67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L, 84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L, 129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L, 155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L, 175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = data.frame) Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that later. I've tried running this: if(mydata$FNAME_TOKEN_COUNT 3 | mydata$FNAME_LENGTH 35 | regexpr(9, mydata$FNAME_PATTERN) 0) mydata$FNAME_SUSPECT - TRUE however I get the error: Warning message: In if (mydata$FNAME_TOKEN_COUNT 3 | mydata$FNAME_LENGTH 35 | : the condition has length 1 and only the first element will be used Would I be better doing this in a for loop? I had once heard that if you're doing a for loop in R, you're doing something wrong. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Updating a data frame based on if condition
This is my first time with ifelse, but I've tried: mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE ) ) ) Error in ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH : unused argument (ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE))) I have the R for Dummies book which covers it a bit, but I just ordered the R Cookbook. On Tue, Feb 18, 2014 at 10:16 AM, David Carlson dcarl...@tamu.edu wrote: Not always true, but it is in this case: ?ifelse David C -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson Sent: Tuesday, February 18, 2014 11:24 AM To: R help Subject: [R] Updating a data frame based on if condition I have a subset of data that I have identified as suspect (for example, the first name has excessive spaces, is longer than 35 characters or has a number). What I want to do is update the FNAME_SUSPECT field in mydata to TRUE if any of those conditions are met. Here's my data: dput(mydata) structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR, CS AND JEFF, 88, 4465891170098562, 1124211, LEWIS MARY KAY, KARL R O S, 5466181820076010, JULI0 C, WAYNE T., 1124211, 1124211, ROBERT B VIONA D, DENNIS and MARY SUE, BRIAN JOANNE, 1124211, RONALD and GAIL, Mike and Mary Lou, 31763006, 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE, 1124211, WAYNE LU ANN, SCOTT ANNA MARIE, 1124211, 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA S, 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA LUZ, G S COMPUTERIZED GRADING, 1124211, 1124211, 1124211, 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211, Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and A. Judith, MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER), FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L, 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L, 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L, 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L, 21L, 15L), FNAME_PATTERN = c(999, A,_AA,_AA_AAA_, 99, , 999, A___AAA, _A_A_A, , 9_A, A___A., 999, 999, AA_A__A_A, AA_AAA__AAA, A___AA, 999, AA_AAA__, _AAA__AAA, , 9, , _AAA__, A_AAA__A, 999, A__AA_AAA, A___A, 999, 999, 99, A,_AA_AAA_AAA, ___A, 99, , A_AA_AA_AAA, A_AA_AA_AAA, A__A__AAA, 999, 999, 999, 999, A_AA_AA_AAA, AA_AAA_AA_A, 999, AAA_A._AAA_A_A., 999, 999, A_A._AAA_A._AA, AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_ ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L, 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L, 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names = c(PERSON_FIRST_NAME, FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN, FNAME_TOKEN_COUNT ), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L, 25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L, 67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L, 84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L, 129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L, 155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L, 175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = data.frame) Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that later. I've tried running this: if(mydata$FNAME_TOKEN_COUNT 3 | mydata$FNAME_LENGTH 35 | regexpr(9, mydata$FNAME_PATTERN) 0) mydata$FNAME_SUSPECT - TRUE however I get the error: Warning message: In if (mydata$FNAME_TOKEN_COUNT 3 | mydata$FNAME_LENGTH 35 | : the condition has length 1 and only the first element will be used Would I be better doing this in a for loop? I had once heard that if you're doing a for loop in R, you're doing something wrong. -- Jeff [[alternative HTML
Re: [R] Updating a data frame based on if condition
Ahh, I was specifying the second argument FALSE incorrectly. Works now as: mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, ifelse(mydata$FNAME_LENGTH 55, TRUE, ifelse(regexpr(9, mydata$FNAME_PATTERN) == 0, TRUE, FALSE ) ) ) On Tue, Feb 18, 2014 at 10:21 AM, Jeff Johnson mrjeffto...@gmail.comwrote: This is my first time with ifelse, but I've tried: mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE ) ) ) Error in ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH : unused argument (ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE))) I have the R for Dummies book which covers it a bit, but I just ordered the R Cookbook. On Tue, Feb 18, 2014 at 10:16 AM, David Carlson dcarl...@tamu.edu wrote: Not always true, but it is in this case: ?ifelse David C -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson Sent: Tuesday, February 18, 2014 11:24 AM To: R help Subject: [R] Updating a data frame based on if condition I have a subset of data that I have identified as suspect (for example, the first name has excessive spaces, is longer than 35 characters or has a number). What I want to do is update the FNAME_SUSPECT field in mydata to TRUE if any of those conditions are met. Here's my data: dput(mydata) structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR, CS AND JEFF, 88, 4465891170098562, 1124211, LEWIS MARY KAY, KARL R O S, 5466181820076010, JULI0 C, WAYNE T., 1124211, 1124211, ROBERT B VIONA D, DENNIS and MARY SUE, BRIAN JOANNE, 1124211, RONALD and GAIL, Mike and Mary Lou, 31763006, 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE, 1124211, WAYNE LU ANN, SCOTT ANNA MARIE, 1124211, 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA S, 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA LUZ, G S COMPUTERIZED GRADING, 1124211, 1124211, 1124211, 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211, Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and A. Judith, MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER), FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L, 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L, 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L, 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L, 21L, 15L), FNAME_PATTERN = c(999, A,_AA,_AA_AAA_, 99, , 999, A___AAA, _A_A_A, , 9_A, A___A., 999, 999, AA_A__A_A, AA_AAA__AAA, A___AA, 999, AA_AAA__, _AAA__AAA, , 9, , _AAA__, A_AAA__A, 999, A__AA_AAA, A___A, 999, 999, 99, A,_AA_AAA_AAA, ___A, 99, , A_AA_AA_AAA, A_AA_AA_AAA, A__A__AAA, 999, 999, 999, 999, A_AA_AA_AAA, AA_AAA_AA_A, 999, AAA_A._AAA_A_A., 999, 999, A_A._AAA_A._AA, AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_ ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L, 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L, 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names = c(PERSON_FIRST_NAME, FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN, FNAME_TOKEN_COUNT ), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L, 25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L, 67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L, 84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L, 129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L, 155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L, 175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = data.frame) Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that later. I've tried running this: if(mydata$FNAME_TOKEN_COUNT 3 | mydata$FNAME_LENGTH
Re: [R] Updating a data frame based on if condition
Thanks David, that's a great improvement. On Tue, Feb 18, 2014 at 12:36 PM, David Carlson dcarl...@tamu.edu wrote: What you have can work, but it will be hard to maintain and debug. Easier to follow is cond1 - mydata$FNAME_TOKEN_COUNT 3 cond2 - mydata$FNAME_LENGTH 55 cond3 - regexpr(9, mydata$FNAME_PATTERN) == 0 mydata$FNAME_SUSPECT - apply(cbind(cond1, cond2, cond3), 1, any) mydata$FNAME_SUSPECT [1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE [13] TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE [25] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE [37] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE [49] TRUE TRUE And adding or changing a condition is pretty simple David C From: Jeff Johnson [mailto:mrjeffto...@gmail.com] Sent: Tuesday, February 18, 2014 12:54 PM To: dcarl...@tamu.edu Cc: R help Subject: Re: [R] Updating a data frame based on if condition Ahh, I was specifying the second argument FALSE incorrectly. Works now as: mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, ifelse(mydata$FNAME_LENGTH 55, TRUE, ifelse(regexpr(9, mydata$FNAME_PATTERN) == 0, TRUE, FALSE ) ) ) On Tue, Feb 18, 2014 at 10:21 AM, Jeff Johnson mrjeffto...@gmail.com wrote: This is my first time with ifelse, but I've tried: mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE ) ) ) Error in ifelse(mydata$FNAME_TOKEN_COUNT 3, TRUE, FALSE, ifelse(mydata$FNAME_LENGTH : unused argument (ifelse(mydata$FNAME_LENGTH 35, TRUE, FALSE, ifelse(regexpr(9, mydata$FNAME_PATTERN) 0, TRUE, FALSE))) I have the R for Dummies book which covers it a bit, but I just ordered the R Cookbook. On Tue, Feb 18, 2014 at 10:16 AM, David Carlson dcarl...@tamu.edu wrote: Not always true, but it is in this case: ?ifelse David C -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson Sent: Tuesday, February 18, 2014 11:24 AM To: R help Subject: [R] Updating a data frame based on if condition I have a subset of data that I have identified as suspect (for example, the first name has excessive spaces, is longer than 35 characters or has a number). What I want to do is update the FNAME_SUSPECT field in mydata to TRUE if any of those conditions are met. Here's my data: dput(mydata) structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR, CS AND JEFF, 88, 4465891170098562, 1124211, LEWIS MARY KAY, KARL R O S, 5466181820076010, JULI0 C, WAYNE T., 1124211, 1124211, ROBERT B VIONA D, DENNIS and MARY SUE, BRIAN JOANNE, 1124211, RONALD and GAIL, Mike and Mary Lou, 31763006, 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE, 1124211, WAYNE LU ANN, SCOTT ANNA MARIE, 1124211, 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA S, 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA LUZ, G S COMPUTERIZED GRADING, 1124211, 1124211, 1124211, 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211, Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and A. Judith, MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER), FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L, 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L, 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L, 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L, 21L, 15L), FNAME_PATTERN = c(999, A,_AA,_AA_AAA_, 99, , 999, A___AAA, _A_A_A, , 9_A, A___A., 999, 999, AA_A__A_A, AA_AAA__AAA, A___AA, 999, AA_AAA__, _AAA__AAA, , 9, , _AAA__, A_AAA__A, 999, A__AA_AAA, A___A, 999, 999, 99, A,_AA_AAA_AAA, ___A, 99, , A_AA_AA_AAA, A_AA_AA_AAA, A__A__AAA, 999, 999, 999, 999, A_AA_AA_AAA, AA_AAA_AA_A, 999, AAA_A._AAA_A_A., 999, 999, A_A._AAA_A._AA
[R] In RStudio/Win7, which directory stores the markdown.css file?
I'm running Windows 7 and RStudio .98.490. I need to edit the CSS file to test something out, but I've found multiple files and changing them seems to do nothing. It almost seems like the CSS may be cached or something since all changes to it do nothing. FYI, I've read the tutorials on custom CSS as well. Thanks in advance. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Controlling font size on code chunk outputs using Knitr
Yihui/Jeff, I'm trying to determine where the default CSS file is located as I don't see this in any of the documentation. I can definitely find a markdwon.css file in C:\Program Files\RStudio\resources I also see an R.css file in that directory. I also have R.css in C:\Users\jeffjohn\Dropbox\R\Rlibs\rstudio\html which is where I have all of my packages installed. Would you know how I can determine what CSS file a given .Rmd file is referencing? However, I've tried making a simple change to each of them (first backing them up of course) by changing the h1 to small instead of x-large and saving the doc, but when I knit the document it does not change anything. Any guidance you can provide would be extremely helpful. Again, I'm using R-Studio on Windows. On Thu, Jan 30, 2014 at 3:54 PM, Jeff Johnson mrjeffto...@gmail.com wrote: Thanks Yihui and Jeff. I've retrieved the default CSS file and made a tweak to it (changing a header 1 size just to test it) and saved it to the same local directory as my .Rmd file using the name 'mymarkdown.css' for testing. I've added: options(rstudio.markdownToHTML = function(inputFile, outputFile) { require(markdown) markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css') } ) to the top of my testfile.Rmd file so that my file now looks like: options(rstudio.markdownToHTML = function(inputFile, outputFile) { require(markdown) markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css') } ) Title This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown). When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ```{r} summary(cars) ``` But when I knit it, it just writes the options chunk at the top of my document. Am I supposed to add something else to get the .rmd file to reference the css? I'm quite new to programming and R (as if you couldn't tell!), so not sure what additional steps I need to add. Thanks much. Jeff On Thu, Jan 30, 2014 at 1:48 PM, Yihui Xie x...@yihui.name wrote: Exactly. Please see RStudio documentation: https://support.rstudio.com/hc/en-us/articles/200552186-Customizing-Markdown-Rendering Regards, Yihui -- Yihui Xie xieyi...@gmail.com Web: http://yihui.name On Thu, Jan 30, 2014 at 10:57 AM, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: This sounds like a classic you need to write a custom CSS file problem... Which is off-topic here, so is homework for you. On January 30, 2014 8:34:32 AM PST, Jeff Johnson mrjeffto...@gmail.com wrote: Hi Yihui, The package I have installed is knitr. To generate the HTML, I run Knit HTML from within R Studio version .98.490 (there's an icon to initiate it. You can load that dataset, then: Print the column names ```{r, echo=showcode, comment=commentchar} colnames(mydf) ``` The resulting font is a couple of points larger than I'd like. I'd like to be able to control this either globally or at the code chunk level. Thanks for your help with this! -- Jeff -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Controlling font size on code chunk outputs using Knitr
Hi Yihui, The package I have installed is knitr. To generate the HTML, I run Knit HTML from within R Studio version .98.490 (there's an icon to initiate it. Here's a simple example: showcode - FALSE commentchar - NA You can load this data as 'mydf'... dput(mydf) structure(list(PERSONPROFILE_POS = DV, PARTY_ID = 95252415L, PERSON_FIRST_NAME = Julie, PERSON_LAST_NAME = herlastname, PERSON_MIDDLE_NAME = NA_character_, PARTY_NUMBER = 49229698L, ACCOUNT_NUMBER = 104205066L, ABILITEC_LINK = 25455695, ADDRESS1 = 332 SE SOME RD, ADDRESS2 = NA_character_, ADDRESS3 = NA_character_, ADDRESS4 = NA_character_, CITY = SOMECITY, COUNTY = SOMECOUNTY, STATE = OR, PROVINCE = NA_character_, POSTAL_CODE = 97111-, COUNTRY = US, PRIMARY_PER_TYPE = N, SELLTOADDR_LOS = DV, LOCATION_ID = 6222438L, SELLTOADDR_SOS = DV, PARTY_SITE_ID = 7292226L, PRIMARYPHONE_CPOS = DV, CONTACT_POINT_ID_PCP = 62243903L, CONTACT_POINT_PURPOSE_PCP = PERSONAL, PHONE_LINE_TYPE = GEN, PRIMARY_FLAG_PCP = Y, PHONE_COUNTRY_CODE = NA_integer_, PHONE_AREA_CODE = 244, PHONE_NUMBER = 244, EMAIL_CPOS = DV, CONTACT_POINT_ID_ECP = 6202L, CONTACT_POINT_PURPOSE_ECP = NA_character_, PRIMARY_FLAG_ECP = Y, EMAIL_ADDRESS = someem...@yahoo.com, BB_PARTY_ID = NA, VALID_COUNTRY = TRUE, VALID_USSTATE = TRUE, POSTAL_PATTERN = 9-, VALID_USPP = TRUE, FULL_PHONE = 244244, FULL_PHONE_PATTERN = NA99, FNAME_PATTERN = A, FNAME_LENGTH = 5L, FNAME_TOKEN_COUNT = 1L, LNAME_LENGTH = 4L, LNAME_PATTERN = , MNAME_LENGTH = 2L, MNAME_PATTERN = NA_character_, MNAME_TOKEN_COUNT = 1L, LNAME_TOKEN_COUNT = 1L, EMAIL_LENGTH = 19L, VALID_EMAIL = TRUE), .Names = c(PERSONPROFILE_POS, PARTY_ID, PERSON_FIRST_NAME, PERSON_LAST_NAME, PERSON_MIDDLE_NAME, PARTY_NUMBER, ACCOUNT_NUMBER, ABILITEC_LINK, ADDRESS1, ADDRESS2, ADDRESS3, ADDRESS4, CITY, COUNTY, STATE, PROVINCE, POSTAL_CODE, COUNTRY, PRIMARY_PER_TYPE, SELLTOADDR_LOS, LOCATION_ID, SELLTOADDR_SOS, PARTY_SITE_ID, PRIMARYPHONE_CPOS, CONTACT_POINT_ID_PCP, CONTACT_POINT_PURPOSE_PCP, PHONE_LINE_TYPE, PRIMARY_FLAG_PCP, PHONE_COUNTRY_CODE, PHONE_AREA_CODE, PHONE_NUMBER, EMAIL_CPOS, CONTACT_POINT_ID_ECP, CONTACT_POINT_PURPOSE_ECP, PRIMARY_FLAG_ECP, EMAIL_ADDRESS, BB_PARTY_ID, VALID_COUNTRY, VALID_USSTATE, POSTAL_PATTERN, VALID_USPP, FULL_PHONE, FULL_PHONE_PATTERN, FNAME_PATTERN, FNAME_LENGTH, FNAME_TOKEN_COUNT, LNAME_LENGTH, LNAME_PATTERN, MNAME_LENGTH, MNAME_PATTERN, MNAME_TOKEN_COUNT, LNAME_TOKEN_COUNT, EMAIL_LENGTH, VALID_EMAIL ), row.names = 1L, class = data.frame) You can load that dataset, then: Print the column names ```{r, echo=showcode, comment=commentchar} colnames(mydf) ``` The resulting font is a couple of points larger than I'd like. I'd like to be able to control this either globally or at the code chunk level. Thanks for your help with this! On Wed, Jan 29, 2014 at 5:57 PM, Yihui Xie x...@yihui.name wrote: Please provide a minimal example -- are you using R Markdown or R HTML? Both can produce HTML output: http://yihui.name/knitr/demo/minimal/ Regards, Yihui -- Yihui Xie xieyi...@gmail.com Web: http://yihui.name On Wed, Jan 29, 2014 at 10:49 AM, Jeff Johnson mrjeffto...@gmail.com wrote: Hi there, I'm currently using knitr to generate an html file, however the output of my code is in a font size that's larger than I desire. I've been looking through various options for controlling the font size of the code results, such as the knitr manual, opts_chunk, and latex. The actual code itself is not being outputted as desired (I set echo=FALSE intentionally). However, I wish to make the results of executing the code a couple of font sizes smaller. I'll likely wish to have all code output chunks be smaller, so a global setting is fine, though I would also appreciate understanding how to control it at the chunk level as well. Does any one have a recommendation on how to do this? Lots of discussion on Google, but I don't see any tangible results. I'm still pretty new to R however. Thanks in advance. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Controlling font size on code chunk outputs using Knitr
Thanks Yihui and Jeff. I've retrieved the default CSS file and made a tweak to it (changing a header 1 size just to test it) and saved it to the same local directory as my .Rmd file using the name 'mymarkdown.css' for testing. I've added: options(rstudio.markdownToHTML = function(inputFile, outputFile) { require(markdown) markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css') } ) to the top of my testfile.Rmd file so that my file now looks like: options(rstudio.markdownToHTML = function(inputFile, outputFile) { require(markdown) markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css') } ) Title This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown). When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ```{r} summary(cars) ``` But when I knit it, it just writes the options chunk at the top of my document. Am I supposed to add something else to get the .rmd file to reference the css? I'm quite new to programming and R (as if you couldn't tell!), so not sure what additional steps I need to add. Thanks much. Jeff On Thu, Jan 30, 2014 at 1:48 PM, Yihui Xie x...@yihui.name wrote: Exactly. Please see RStudio documentation: https://support.rstudio.com/hc/en-us/articles/200552186-Customizing-Markdown-Rendering Regards, Yihui -- Yihui Xie xieyi...@gmail.com Web: http://yihui.name On Thu, Jan 30, 2014 at 10:57 AM, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: This sounds like a classic you need to write a custom CSS file problem... Which is off-topic here, so is homework for you. On January 30, 2014 8:34:32 AM PST, Jeff Johnson mrjeffto...@gmail.com wrote: Hi Yihui, The package I have installed is knitr. To generate the HTML, I run Knit HTML from within R Studio version .98.490 (there's an icon to initiate it. You can load that dataset, then: Print the column names ```{r, echo=showcode, comment=commentchar} colnames(mydf) ``` The resulting font is a couple of points larger than I'd like. I'd like to be able to control this either globally or at the code chunk level. Thanks for your help with this! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Controlling font size on code chunk outputs using Knitr
Hi there, I'm currently using knitr to generate an html file, however the output of my code is in a font size that's larger than I desire. I've been looking through various options for controlling the font size of the code results, such as the knitr manual, opts_chunk, and latex. The actual code itself is not being outputted as desired (I set echo=FALSE intentionally). However, I wish to make the results of executing the code a couple of font sizes smaller. I'll likely wish to have all code output chunks be smaller, so a global setting is fine, though I would also appreciate understanding how to control it at the chunk level as well. Does any one have a recommendation on how to do this? Lots of discussion on Google, but I don't see any tangible results. I'm still pretty new to R however. Thanks in advance. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Controlling font size on code chunk outputs using Knitr
Thank you Yihui for responding. I'll reply with details when I get in the office tomorrow am. I'm using Rstudio and added the knitr package if that helps. I'll check details and provide an example tomorrow am. I appreciate your help. Sent from my iPhone On Jan 29, 2014, at 5:57 PM, Yihui Xie x...@yihui.name wrote: Please provide a minimal example -- are you using R Markdown or R HTML? Both can produce HTML output: http://yihui.name/knitr/demo/minimal/ Regards, Yihui -- Yihui Xie xieyi...@gmail.com Web: http://yihui.name On Wed, Jan 29, 2014 at 10:49 AM, Jeff Johnson mrjeffto...@gmail.com wrote: Hi there, I'm currently using knitr to generate an html file, however the output of my code is in a font size that's larger than I desire. I've been looking through various options for controlling the font size of the code results, such as the knitr manual, opts_chunk, and latex. The actual code itself is not being outputted as desired (I set echo=FALSE intentionally). However, I wish to make the results of executing the code a couple of font sizes smaller. I'll likely wish to have all code output chunks be smaller, so a global setting is fine, though I would also appreciate understanding how to control it at the chunk level as well. Does any one have a recommendation on how to do this? Lots of discussion on Google, but I don't see any tangible results. I'm still pretty new to R however. Thanks in advance. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] KnitR/RMarkdown: Is there a way to not print a section of the document?
I've been looking through the R documents to see if there's a way to not output certain chunks of code. I'm trying to present a document to a team of folks that won't necessarily be interested in the line-by-line code, though they are interested in the charts, etc. Thus, I'd like to not output certain chunks of code. Is there a way to suppress sections? Thank you. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] subset and na.rm not really suppressing NA values
I have a dataset mydf with a field EMAIL_ADDRESS. When importing, I specified: mydf - read.csv(file = extract, header = TRUE, stringsAsFactors = FALSE, na.strings=c(NA,)) I've also tried setting na.strings= c(NA,,NA) but I don't know if it's appropriate to put NA there. I'm running a - subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select = EMAIL_ADDRESS) dput(head(a,5)) structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_, NA_character_, NA_character_, NA_character_)), .Names = EMAIL_ADDRESS, row.names = c(17L, 22L, 23L, 24L, 30L), class = data.frame) The results show a lot of NA values on screen and in the dput statement. I don't quite understand why it is doing that. I would have expected it to exclude those since I had the na.rm = TRUE statement. Do you have any suggestions? Thanks! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Any recommendations for reusable profiling of name fields?
Hi, I'm pretty new to R and am trying to develop a reusable set of scripts that I can use to profile various data types and common fields in our database. I know that what I'm asking is a can of worms, so please bear with me. :) For example, we store a person's first name, last name, phone number, email address, last gift amount, gift date, etc. as well as integer type data. I'm wondering if there's a best practice for validating a field that holds, for example, first name or last name. A couple of things I've come up with are: 1) Count of characters (nchar) in the first (or last) name field 2) Number of unique tokens 3) Patterns (converting alpha to A and numeric to N) and count the frequency of each unique pattern that results.I suppose I could make lower case alpha 'a' and upper = 'A' to be more specific. 4) Min and max name (helps identify those with leading spaces, numbers) Does anyone have more suggestions for techniques that are common or that you'd recommend for name fields? Ultimately, I'm looking to develop a common set of profiles for various data types, so if there's a white paper (I've googled, but not found any that hit the mark yet) I'd love to see it. Perhaps there's even a package for this type of thing? Thanks much! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] For loop on column names
I'm trying to find a more efficient to calculate the percent a field is populated and repeat it for each field (column). First, I'm counting the number of lines: lines - as.integer(countLines(extract) - 1) dput(lines) 10L extract - 'C:/Users/jeffjohn/Desktop/batchextract_100k_sample.csv' mydf - read.csv(file = extract, header = TRUE) Here's the list of columns in my file: dput(colnames(mydf)) c(PERSONPROFILE_POS, PARTY_ID, PERSON_FIRST_NAME, PERSON_LAST_NAME, PERSON_MIDDLE_NAME, PARTY_NUMBER, ACCOUNT_NUMBER, ABILITEC_LINK, ADDRESS1, ADDRESS2, ADDRESS3, ADDRESS4, CITY, COUNTY, STATE, PROVINCE, POSTAL_CODE, COUNTRY, PRIMARY_PER_TYPE, SELLTOADDR_LOS, LOCATION_ID, SELLTOADDR_SOS, PARTY_SITE_ID, PRIMARYPHONE_CPOS, CONTACT_POINT_ID_PCP, CONTACT_POINT_PURPOSE_PCP, PHONE_LINE_TYPE, PRIMARY_FLAG_PCP, PHONE_COUNTRY_CODE, PHONE_AREA_CODE, PHONE_NUMBER, EMAIL_CPOS, CONTACT_POINT_ID_ECP, CONTACT_POINT_PURPOSE_ECP, PRIMARY_FLAG_ECP, EMAIL_ADDRESS, BB_PARTY_ID) I want to count the percentage populated for each field. Rather than do: percent(length(is.null(mydf$PERSONPROFILE_POS)) / lines) percent(length(is.null(mydf$PARTY_ID)) / lines) etc. and repeat for each field manually, I want to use a for loop. I am trying the following: a - length(colnames(mydf)) # this is to get the total number of columns for (i in 1:a) print((percent(length(is.null(a)) / lines)) which isn't correct. I'm new to programming, so I don't quite know how to deal with this. Any suggestions? Thanks much. -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Any recommendations for reusable profiling of name fields?
Hi, I'm pretty new to R and am trying to develop a reusable set of scripts that I can use to profile various data types and common fields in our database. I know that what I'm asking is a can of worms, so please bear with me. :) For example, we store a person's first name, last name, phone number, email address, last gift amount, gift date, etc. as well as integer type data. I'm wondering if there's a best practice for validating a field that holds, for example, first name or last name. A couple of things I've come up with are: 1) Count of characters (nchar) in the first (or last) name field 2) Number of unique tokens 3) Patterns (converting alpha to A and numeric to N) and count the frequency of each unique pattern that results.I suppose I could make lower case alpha 'a' and upper = 'A' to be more specific. 4) Min and max name (helps identify those with leading spaces, numbers) Does anyone have more suggestions for techniques that are common or that you'd recommend for name fields? Ultimately, I'm looking to develop a common set of profiles for various data types, so if there's a white paper (I've googled, but not found any that hit the mark yet) I'd love to see it. Perhaps there's even a package for this type of thing? Thanks much! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Barplot not showing all labels
Sorry guys, I'm running into an issue. I have a data frame. Here is the dput output having run: dput(head((non_us),25), file = C:/Users/jeffjohn/Desktop/non_us_sam.csv, control = c(keepNA, keepInteger,showAttributes)) structure(list(COUNTRY = structure(c(4L, 25L, 35L, 12L, 4L, 5L, 14L, 14L, 14L, 12L, 62L, 28L, 9L, 41L, 14L, 34L, 66L, 41L, 21L, 32L, 4L, 9L, 14L, 4L, 28L), .Label = c(AE, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, ZA), class = factor)), .Names = COUNTRY, row.names = c(329L, 1146L, 1474L, 1491L, 1585L, 1997L, 2190L, 2382L, 2442L, 2499L, 2703L, 3151L, 3278L, 3652L, 4730L, 5106L, 5214L, 5447L, 5710L, 5924L, 6185L, 6204L, 6258L, 6383L, 6811L), class = data.frame) This data frame is called non_us I want to plot it so that it shows a chart of COUNTRY and the frequency of each (pretty simple I think). However, I don't know what to pass in for 'aes'. When I type names(non_us) it only shows COUNTRY Any suggestions for what to use for X and Y (assuming both are needed)? ggplot(non_us, aes(x=?, y=?))+ geom_bar(stat = identity, colour = red) + coord_flip() I appreciate your help VERY MUCH! Jeff World Vision On Tue, Jan 14, 2014 at 3:44 PM, Jeff Johnson mrjeffto...@gmail.com wrote: Thanks John (and everyone else as well). John's example got it very close. I can tweak from here. Thanks! On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote: I am not sure that I got the data correctly--it is much better to supply sample data using dput(). See ?dput for more information but I think something like this will work dat1 / - structure(list(cty = structure(1:70, .Label = c(AE, AN, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, US, ZA), class = factor), val = c(0, 3, 0, 2, 1, 31, 4, 1, 1, 1, 45, 1, 1, 4, 5, 86, 3, 1, 8, 1, 2, 1, 8, 2, 1, 2, 4, 2, 4, 35, 3, 3, 14, 3, 5, 2, 5, 1, 2, 1, 15, 1, 11, 2, 2, 1, 1, 23, 7, 1, 6, 1, 3, 1, 2, 1, 1, 8, 1, 1, 1, 1, 1, 18, 1, 1, 2, 11, 1, 0)), .Names = c(cty, val), row.names = c(NA, -70L), class = data.frame) library(ggplot2) ggplot(dat1, aes(cty, val))+ geom_bar(stat = identity, colour = red) + coord_flip() It will take some cleaning up using theme() but I think it supplies the essentials that you want. John Kane Kingston ON Canada -Original Message- From: mrjeffto...@gmail.com Sent: Mon, 13 Jan 2014 11:15:46 -0800 To: r-help@r-project.org Subject: [R] Barplot not showing all labels I have a table that consists of the following country codes and frequencies: AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 I am executing: non_us - table(subset(mydf, (COUNTRY %in% validcountries) COUNTRY != US, select = COUNTRY)) barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count of Non-US Records by Country,col=red) It creates the attached image (I hope images come through on email). Notice that it is not displaying all of the country codes. It shows bars for each country, but only 6 are appearing. Does anyone have a suggestion? I'm open to using qplot, ggplot or ggplot2 (and have tried that), but I want a bar (horizontal) chart not a column chart. Thanks in advance. -- Jeff __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium -- Jeff -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Barplot not showing all labels
Thanks John. Yes I do need to aggregate. I was thinking that ggplot would do the aggregating, but in any event, am now trying this: n - data.frame(table(non_us)) names(n) - c(COUNTRY, FREQ) which then gives me: dput(n) structure(list(COUNTRY = structure(1:68, .Label = c(AE, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, ZA ), class = factor), FREQ = c(3L, 2L, 1L, 31L, 4L, 1L, 1L, 1L, 45L, 1L, 1L, 4L, 5L, 86L, 3L, 1L, 8L, 1L, 2L, 1L, 8L, 2L, 1L, 2L, 4L, 2L, 4L, 35L, 3L, 3L, 14L, 3L, 5L, 2L, 5L, 1L, 2L, 1L, 15L, 1L, 11L, 2L, 2L, 1L, 1L, 23L, 7L, 1L, 6L, 1L, 3L, 1L, 2L, 1L, 1L, 8L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 2L, 11L, 1L, 3L)), .Names = c(COUNTRY, FREQ), row.names = c(NA, -68L), class = data.frame) Then I do the following thinking that it would create the proper chart: p - ggplot(n, aes(x=COUNTRY, Y=FREQ)) + geom_bar() + coord_flip() p However, what I get is the x axis showing 'count' with a scale of 0.00 to 1.00. So then I try to change the limit of x to be from 0 to 100 p - ggplot(n, aes(x=COUNTRY, Y=FREQ)) + geom_bar() + coord_flip() + xlim(0,100) but I get an error: Error: Discrete value supplied to continuous scale. I've tried googling that error and people talk about the data type not being right, but for me str(n) shows 'data.frame': 68 obs. of 2 variables: $ COUNTRY: Factor w/ 68 levels AE,AR,AT,..: 1 2 3 4 5 6 7 8 9 10 ... $ FREQ : int 3 2 1 31 4 1 1 1 45 1 ... To confirm, when attempting to plot a count of occurrences by country in a data frame with multiple possible rows per country, you have to aggregate BEFORE passing it to ggplot? I appreciate your time. On Wed, Jan 15, 2014 at 12:58 PM, John Kane jrkrid...@inbox.com wrote: Thanks for the dput() data.frame. It makes looking at the problem a lot easier. Basically you have a mucked-up data.frame. That is, what you see is not what you think you have. You only have one variable in the data.frame and that is the country names. For some reason the numbers are being considered as row names not as a variable. Do a str(filename) to see what is happening. You do need to have an x and y value. Try something like this: library(ggplot2) dat1$val - rownames(dat1) # create a new y value from the row names ggplot(dat1, aes(COUNTRY, val))+ geom_bar(stat = identity, colour = blue, fill = 'red', position = dodge) + coord_flip() It''s not very pretty but it may give you a start. BTW I see that some countries (GB, CA, Au amongst others) have multiple entries. Does this make sense or should you aggregate before graphing? John Kane Kingston ON Canada -Original Message- From: mrjeffto...@gmail.com Sent: Wed, 15 Jan 2014 09:20:11 -0800 To: jrkrid...@inbox.com Subject: Re: [R] Barplot not showing all labels Sorry guys, I'm running into an issue. I have a data frame. Here is the dput output having run: dput(head((non_us),25), file = C:/Users/jeffjohn/Desktop/non_us_sam.csv, control = c(keepNA, keepInteger,showAttributes)) structure(list(COUNTRY = structure(c(4L, 25L, 35L, 12L, 4L, 5L, 14L, 14L, 14L, 12L, 62L, 28L, 9L, 41L, 14L, 34L, 66L, 41L, 21L, 32L, 4L, 9L, 14L, 4L, 28L), .Label = c(AE, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, ZA), class = factor)), .Names = COUNTRY, row.names = c(329L, 1146L, 1474L, 1491L, 1585L, 1997L, 2190L, 2382L, 2442L, 2499L, 2703L, 3151L, 3278L, 3652L, 4730L, 5106L, 5214L, 5447L, 5710L, 5924L, 6185L, 6204L, 6258L, 6383L, 6811L), class = data.frame) This data frame is called non_us I want to plot it so that it shows a chart of COUNTRY and the frequency of each (pretty simple I think). However, I don't know what to pass in for 'aes'. When I type names(non_us) it only shows COUNTRY Any suggestions for what to use for X and Y (assuming both are needed)? ggplot(non_us, aes(x=?, y=?))+ geom_bar(stat = identity, colour = red) + coord_flip() I appreciate your help VERY MUCH! Jeff World Vision On Tue, Jan 14, 2014 at 3:44 PM, Jeff Johnson mrjeffto...@gmail.com wrote: Thanks John (and everyone else as well). John's example got it very close. I can tweak from here. Thanks! On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote: I am not sure that I got the data correctly--it is much better to supply sample data using dput(). See ?dput for more information but I think something like this will work dat1 / - structure(list(cty = structure(1:70, .Label = c(AE, AN, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI
[R] Subsetting on multiple criteria (AND condition) in R
I'm running the following to get what I would expect is a subset of countries that are not equal to US AND COUNTRY is not in one of my validcountries values. non_us - subset(mydf, (COUNTRY %in% validcountries) COUNTRY != US, select = COUNTRY, na.rm=TRUE) however, when I then do table(non_us) I get: table(non_us) non_us AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 Notice US appears as the second to last. I expected it to NOT appear. Do you know if I'm using incorrect syntax? Is the symbol equivalent to AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != US valid syntax? I don't get errors, but then again I don't get what I expect back. Thanks in advance! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Barplot not showing all labels
Thanks John (and everyone else as well). John's example got it very close. I can tweak from here. Thanks! On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote: I am not sure that I got the data correctly--it is much better to supply sample data using dput(). See ?dput for more information but I think something like this will work dat1 / - structure(list(cty = structure(1:70, .Label = c(AE, AN, AR, AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, US, ZA), class = factor), val = c(0, 3, 0, 2, 1, 31, 4, 1, 1, 1, 45, 1, 1, 4, 5, 86, 3, 1, 8, 1, 2, 1, 8, 2, 1, 2, 4, 2, 4, 35, 3, 3, 14, 3, 5, 2, 5, 1, 2, 1, 15, 1, 11, 2, 2, 1, 1, 23, 7, 1, 6, 1, 3, 1, 2, 1, 1, 8, 1, 1, 1, 1, 1, 18, 1, 1, 2, 11, 1, 0)), .Names = c(cty, val), row.names = c(NA, -70L), class = data.frame) library(ggplot2) ggplot(dat1, aes(cty, val))+ geom_bar(stat = identity, colour = red) + coord_flip() It will take some cleaning up using theme() but I think it supplies the essentials that you want. John Kane Kingston ON Canada -Original Message- From: mrjeffto...@gmail.com Sent: Mon, 13 Jan 2014 11:15:46 -0800 To: r-help@r-project.org Subject: [R] Barplot not showing all labels I have a table that consists of the following country codes and frequencies: AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 I am executing: non_us - table(subset(mydf, (COUNTRY %in% validcountries) COUNTRY != US, select = COUNTRY)) barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count of Non-US Records by Country,col=red) It creates the attached image (I hope images come through on email). Notice that it is not displaying all of the country codes. It shows bars for each country, but only 6 are appearing. Does anyone have a suggestion? I'm open to using qplot, ggplot or ggplot2 (and have tried that), but I want a bar (horizontal) chart not a column chart. Thanks in advance. -- Jeff __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subsetting on multiple criteria (AND condition) in R
Thanks so much Marc and for those that responded. Mark's suggestion with droplevels gave me the desired result. I'm new to figuring out how to post reproducible code. I'll try using the set.seed and rnorm functions next time and hope that does the trick. Thanks everyone! On Tue, Jan 14, 2014 at 1:05 PM, Marc Schwartz marc_schwa...@me.com wrote: On Jan 14, 2014, at 1:38 PM, Jeff Johnson mrjeffto...@gmail.com wrote: I'm running the following to get what I would expect is a subset of countries that are not equal to US AND COUNTRY is not in one of my validcountries values. non_us - subset(mydf, (COUNTRY %in% validcountries) COUNTRY != US, select = COUNTRY, na.rm=TRUE) however, when I then do table(non_us) I get: table(non_us) non_us AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 Notice US appears as the second to last. I expected it to NOT appear. Do you know if I'm using incorrect syntax? Is the symbol equivalent to AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != US valid syntax? I don't get errors, but then again I don't get what I expect back. Thanks in advance! -- Jeff Review the Details section of ?subset, where you will find the following: Factors may have empty levels after subsetting; unused levels are not automatically removed. See droplevels for a way to drop all unused levels from a data frame. Your syntax is fine and the behavior is as expected. Regards, Marc Schwartz -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Barplot not showing all labels
I have a table that consists of the following country codes and frequencies: AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 I am executing: non_us - table(subset(mydf, (COUNTRY %in% validcountries) COUNTRY != US, select = COUNTRY)) barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count of Non-US Records by Country,col=red) It creates the attached image (I hope images come through on email). Notice that it is not displaying all of the country codes. It shows bars for each country, but only 6 are appearing. Does anyone have a suggestion? I'm open to using qplot, ggplot or ggplot2 (and have tried that), but I want a bar (horizontal) chart not a column chart. Thanks in advance. -- Jeff __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Having a problem with labels
Hi, I'm having a problem with my labels. I am reading in a data file: df - read.csv(file = 'batch1extract_100k_sample.csv') However, it's producing two sets of labels: labels(df) [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 [43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 [64] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 [85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 [[2]] [1] PERSONPROFILE_POS PARTY_ID PERSON_FIRST_NAME [4] PERSON_LAST_NAME PERSON_MIDDLE_NAMEPARTY_NUMBER [7] ACCOUNT_NUMBERABILITEC_LINK ADDRESS1 [10] ADDRESS2 ADDRESS3 ADDRESS4 [13] CITY COUNTYSTATE [16] PROVINCE POSTAL_CODE COUNTRY [19] PRIMARY_PER_TYPE SELLTOADDR_LOSLOCATION_ID [22] SELLTOADDR_SOSPARTY_SITE_ID PRIMARYPHONE_CPOS [25] CONTACT_POINT_ID_PCP CONTACT_POINT_PURPOSE_PCP PHONE_LINE_TYPE [28] PRIMARY_FLAG_PCP PHONE_COUNTRY_CODE PHONE_AREA_CODE [31] PHONE_NUMBER EMAIL_CPOS CONTACT_POINT_ID_ECP [34] CONTACT_POINT_PURPOSE_ECP PRIMARY_FLAG_ECP EMAIL_ADDRESS [37] BB_PARTY_ID Notice I get 2 rows for the labels: the first row is a list of numbers (which does not appear in my dataset) and the second row which are my actual labels. I have no idea why it's returning all of the numbers in the labels command. They're definitely not there in the input file. Any suggestions? Thank you! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Patterns on postal codes
Hi all, I'm pretty new to R and have a question. I have a postal_code field which can have a variety of values such as: For US postal codes: 22942-0173 or 32601 For Canada postal codes: N9YZE6 or S7V 1J9 What I want to do is represent these as patterns, such as: US: N- or N Canada: ANAAAN or ANA NAN where N = any number and A = any alpha character, space = space, etc (other characters such as ' should be represented as '. Ultimately I want to count these to see how many have a pattern of N-, ANA NAN, etc so that I can visualize the outliers. Does anyone know if there is a built-in function in R to do this? Currently, the str() function on the postal_code field shows a factor with 90,993 levels which isn't particularly helpful. Thanks in advance! -- Jeff [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.