[R] How to pass variable column name into R function

2014-12-18 Thread Jeff Johnson

I know this has been explained a few times here in different scenarios, but I 
am having a hard time digesting this. 

The following code works fine as long as it's not inside a function (see below).

df$season - as.character(df$season) temp - model.matrix( ~ season - 1, 
data=df) df - cbind(df,temp)
BEFORE:

head(df[c(1,2)]) datetime season 1 2011-01-01 1 2 2011-01-01 1 3 2011-01-01 1 4 
2011-01-01 1 5 2011-01-01 1 6 2011-01-01 1
AFTER:

 head(df[c(1,2,13:16)]) datetime season season1 season2 season3 season4 1 
 2011-01-01 1 1 0 0 0 2 2011-01-01 1 1 0 0 0 3 2011-01-01  1 1 0 0 0 4 
 2011-01-01 1 1 0 0 0 5 2011-01-01 1 1 0 0 0 6 2011-01-01  1 1 0 0 0
However, when I try to wrap it in a multi-use function:

binarize - function(data, myvar) { data$myvar - as.character(data$myvar) temp 
- model.matrix( ~ myvar - 1, data=data) data - cbind(data,temp) }
it throws an error, undoubtedly because it cannot evaluate myvar or data (or 
both?): Error in $-.data.frame(*tmp*, myvar, value = character(0)) : 
replacement has 0 rows, data has 10886

I've tried experimenting with eval(substitute()) but still it's not working. My 
ideal end-state is that you start with a dataframe and a variable, have the 
function map all of the values for the selected variable into separate binary 
columns and append that to the original dataframe. Again, when it's not in a 
function it works perfectly.

Here's the dput data if it helps to reproduce.

 dput(head(df,50)) structure(list(datetime = structure(c(14975, 14975, 14975, 
 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 
 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14975, 14976, 
 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 
 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 14976, 
 14977, 14977, 14977), class = Date), season = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 1L, 1L), holiday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
 0L), workingday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,!
  0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), 
weather = c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), temp = c(9.84, 9.02, 9.02, 
9.84, 9.84, 9.84, 9.02, 8.2, 9.84, 13.12, 15.58, 14.76, 17.22, 18.86, 18.86, 
18.04, 17.22, 18.04, 17.22, 17.22, 16.4, 16.4, 16.4, 18.86, 18.86, 18.04, 
17.22, 18.86, 18.86, 17.22, 16.4, 16.4, 15.58, 14.76, 14.76, 14.76, 14.76, 
14.76, 13.94, 13.94, 13.94, 14.76, 13.12, 12.3, 10.66, 9.84, 9.02, 9.02, 8.2, 
6.56), atemp = c(14.395, 13.635, 13.635, 14.395, 14.395, 12.88, 13.635, 12.88, 
14.395, 17.425, 19.695, 16.665, 21.21, 22.725, 22.725, 21.97, 21.21, 21.97, 
21.21, 21.21, 20.455, 20.455, 20.455, 22.725, 22.725, 21.97, 21.21, 22.725, 
22.725, 21.21, 20.455, 20.455, 19.695, 17.425, 16.665, 16.665, 17.425, 17.425, 
16.665, 16.665, 16.665, 16.665, 14.395, 13.635, 11.365, !
 10.605, 11.365, 9.85, 8.335, 6.82), humidity = c(81L, 80L, 80L, 75L, 7
5L, 75L, 80L, 86L, 75L, 76L, 76L, 81L, 77L, 72L, 72L, 77L, 82L, 82L, 88L, 88L, 
87L, 87L, 94L, 88L, 88L, 94L, 100L, 94L, 94L, 77L, 76L, 71L, 76L, 81L, 71L, 
66L, 66L, 76L, 81L, 71L, 57L, 46L, 42L, 39L, 44L, 44L, 47L, 44L, 44L, 47L), 
windspeed = c(0, 0, 0, 0, 0, 6.0032,  0, 0, 0, 0, 16.9979, 19.0012, 19.0012, 
19.9995, 19.0012, 19.9995, 19.9995, 19.0012, 16.9979, 16.9979, 16.9979, 12.998, 
15.0013, 19.9995, 19.9995, 16.9979, 19.0012, 12.998, 12.998, 19.9995, 12.998, 
15.0013, 15.0013, 15.0013, 16.9979, 19.9995, 8.9981, 12.998, 11.0014, 11.0014, 
12.998, 22.0028, 30.0026, 23.9994, 22.0028, 19.9995, 11.0014, 23.9994, 27.9993, 
26.0027), casual = c(3L, 8L, 5L, 3L, 0L, 0L, 2L, 1L, 1L, 8L, 12L, 26L, 29L, 
47L, 35L, 40L, 41L, 15L, 9L, 6L, 11L, 3L, 11L, 15L, 4L, 1L, 1L, 2L, 2L, 0L, 0L, 
0L, 1L, 7L, 16L, 20L, 11L, 4L, 19L, 9L, 7L, 10L, 1L, 5L, 11L, 0L, 0L, 0L, 0L, 
0L), registered = c(13L, 32L, 27L, 10L, 1L, 1L, 0L, 2L, 7L, 6L, 24L, 30L, 55L, 
47L, 71L, 70L, 52L, 52L, 26L, 31L, 25L, 31L, 17L, 2!
 4L, 13L, 16L, 8L, 4L, 1L, 2L, 1L, 8L, 19L, 46L, 54L, 73L, 64L, 55L, 55L, 67L, 
58L, 43L, 29L, 17L, 20L, 9L, 8L, 5L, 2L, 1L), count = c(16L, 40L, 32L, 13L, 1L, 
1L, 2L, 3L, 8L, 14L, 36L, 56L, 84L, 94L, 106L, 110L, 93L, 67L, 35L, 37L, 36L, 
34L, 28L, 39L, 17L, 17L, 9L, 6L, 3L, 2L, 1L, 8L, 20L, 53L, 70L, 93L, 75L, 59L, 
74L, 76L, 65L, 53L, 30L, 22L, 31L, 9L, 8L, 5L, 2L, 1L)), .Names = c(datetime, 
season, holiday, workingday, weather, temp, atemp, 

[R] Updating a data frame based on if condition

2014-02-18 Thread Jeff Johnson
I have a subset of data that I have identified as suspect (for example,
the first name has excessive spaces, is longer than 35 characters or has a
number).

What I want to do is update the FNAME_SUSPECT field in mydata to TRUE if
any of those conditions are met.

Here's my data:
 dput(mydata)
structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR, CS AND
JEFF,
88, 4465891170098562, 1124211, LEWIS  MARY KAY, KARL R O S,
5466181820076010, JULI0 C, WAYNE   T., 1124211, 1124211,
ROBERT B  VIONA D, DENNIS and MARY SUE, BRIAN   JOANNE,
1124211, RONALD and  GAIL, Mike and Mary Lou, 31763006,
7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE,
1124211, WAYNE  LU ANN, SCOTT  ANNA MARIE, 1124211,
1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA   S,
707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA LUZ,
G  S COMPUTERIZED GRADING, 1124211, 1124211, 1124211,
1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211,
Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and A. Judith,
MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER),
FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
21L, 15L), FNAME_PATTERN = c(999, A,_AA,_AA_AAA_,
99, , 999, A___AAA,
_A_A_A, , 9_A, A___A.,
999, 999, AA_A__A_A, AA_AAA__AAA,
A___AA, 999, AA_AAA__, _AAA__AAA,
, 9, , _AAA__,
A_AAA__A,
999, A__AA_AAA, A___A, 999,
999, 99, A,_AA_AAA_AAA, ___A,
99, , A_AA_AA_AAA, A_AA_AA_AAA,
A__A__AAA, 999, 999, 999,
999, A_AA_AA_AAA, AA_AAA_AA_A, 999,
AAA_A._AAA_A_A., 999, 999,
A_A._AAA_A._AA,
AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_
), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L,
2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L,
1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L,
1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names =
c(PERSON_FIRST_NAME,
FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN, FNAME_TOKEN_COUNT
), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L,
25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L,
67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L,
84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L,
129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L,
155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L,
175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = data.frame)

Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that
later.

I've tried running this:
if(mydata$FNAME_TOKEN_COUNT  3 | mydata$FNAME_LENGTH  35 | regexpr(9,
mydata$FNAME_PATTERN)  0)
mydata$FNAME_SUSPECT - TRUE

however I get the error:
Warning message:
In if (mydata$FNAME_TOKEN_COUNT  3 | mydata$FNAME_LENGTH  35 |  :
  the condition has length  1 and only the first element will be used

Would I be better doing this in a for loop? I had once heard that if you're
doing a for loop in R, you're doing something wrong.
-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Updating a data frame based on if condition

2014-02-18 Thread Jeff Johnson
This is my first time with ifelse, but I've tried:

mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE, FALSE,
 ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
ifelse(regexpr(9, mydata$FNAME_PATTERN)  0, TRUE,
FALSE
   )
  )
)

Error in ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE, FALSE,
ifelse(mydata$FNAME_LENGTH   :
  unused argument (ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
ifelse(regexpr(9, mydata$FNAME_PATTERN)  0, TRUE, FALSE)))

I have the R for Dummies book which covers it a bit, but I just ordered the
R Cookbook.


On Tue, Feb 18, 2014 at 10:16 AM, David Carlson dcarl...@tamu.edu wrote:

 Not always true, but it is in this case:

 ?ifelse

 David C

 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson
 Sent: Tuesday, February 18, 2014 11:24 AM
 To: R help
 Subject: [R] Updating a data frame based on if condition

 I have a subset of data that I have identified as suspect (for
 example,
 the first name has excessive spaces, is longer than 35
 characters or has a
 number).

 What I want to do is update the FNAME_SUSPECT field in mydata
 to TRUE if
 any of those conditions are met.

 Here's my data:
  dput(mydata)
 structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR,
 CS AND
 JEFF,
 88, 4465891170098562, 1124211, LEWIS  MARY KAY, KARL R
 O S,
 5466181820076010, JULI0 C, WAYNE   T., 1124211,
 1124211,
 ROBERT B  VIONA D, DENNIS and MARY SUE, BRIAN   JOANNE,
 1124211, RONALD and  GAIL, Mike and Mary Lou, 31763006,
 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE,
 1124211, WAYNE  LU ANN, SCOTT  ANNA MARIE, 1124211,
 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA
 S,
 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA
 LUZ,
 G  S COMPUTERIZED GRADING, 1124211, 1124211, 1124211,
 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211,
 Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and
 A. Judith,
 MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER),
 FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
 FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
 21L, 15L), FNAME_PATTERN = c(999,
 A,_AA,_AA_AAA_,
 99, , 999, A___AAA,
 _A_A_A, , 9_A, A___A.,
 999, 999, AA_A__A_A,
 AA_AAA__AAA,
 A___AA, 999, AA_AAA__,
 _AAA__AAA,
 , 9, , _AAA__,
 A_AAA__A,
 999, A__AA_AAA, A___A,
 999,
 999, 99, A,_AA_AAA_AAA,
 ___A,
 99, , A_AA_AA_AAA,
 A_AA_AA_AAA,
 A__A__AAA, 999, 999,
 999,
 999, A_AA_AA_AAA, AA_AAA_AA_A,
 999,
 AAA_A._AAA_A_A., 999, 999,
 A_A._AAA_A._AA,
 AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_
 ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L,
 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L,
 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L,
 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names =
 c(PERSON_FIRST_NAME,
 FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN,
 FNAME_TOKEN_COUNT
 ), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L,
 25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L,
 67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L,
 84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L,
 128305L,
 129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L,
 155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L,
 175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class =
 data.frame)

 Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to
 change that
 later.

 I've tried running this:
 if(mydata$FNAME_TOKEN_COUNT  3 | mydata$FNAME_LENGTH  35 |
 regexpr(9,
 mydata$FNAME_PATTERN)  0)
 mydata$FNAME_SUSPECT - TRUE

 however I get the error:
 Warning message:
 In if (mydata$FNAME_TOKEN_COUNT  3 | mydata$FNAME_LENGTH  35 |
 :
   the condition has length  1 and only the first element will
 be used

 Would I be better doing this in a for loop? I had once heard
 that if you're
 doing a for loop in R, you're doing something wrong.
 --
 Jeff

 [[alternative HTML

Re: [R] Updating a data frame based on if condition

2014-02-18 Thread Jeff Johnson
Ahh, I was specifying the second argument FALSE incorrectly. Works now as:

mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE,
 ifelse(mydata$FNAME_LENGTH  55, TRUE,
ifelse(regexpr(9, mydata$FNAME_PATTERN) == 0, TRUE,
FALSE
   )
  )
)



On Tue, Feb 18, 2014 at 10:21 AM, Jeff Johnson mrjeffto...@gmail.comwrote:

 This is my first time with ifelse, but I've tried:

 mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE, FALSE,
  ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
 ifelse(regexpr(9, mydata$FNAME_PATTERN)  0, TRUE,
 FALSE
)
   )
 )

 Error in ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE, FALSE,
 ifelse(mydata$FNAME_LENGTH   :
   unused argument (ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
 ifelse(regexpr(9, mydata$FNAME_PATTERN)  0, TRUE, FALSE)))

 I have the R for Dummies book which covers it a bit, but I just ordered
 the R Cookbook.


 On Tue, Feb 18, 2014 at 10:16 AM, David Carlson dcarl...@tamu.edu wrote:

 Not always true, but it is in this case:

 ?ifelse

 David C

 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson
 Sent: Tuesday, February 18, 2014 11:24 AM
 To: R help
 Subject: [R] Updating a data frame based on if condition

 I have a subset of data that I have identified as suspect (for
 example,
 the first name has excessive spaces, is longer than 35
 characters or has a
 number).

 What I want to do is update the FNAME_SUSPECT field in mydata
 to TRUE if
 any of those conditions are met.

 Here's my data:
  dput(mydata)
 structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR,
 CS AND
 JEFF,
 88, 4465891170098562, 1124211, LEWIS  MARY KAY, KARL R
 O S,
 5466181820076010, JULI0 C, WAYNE   T., 1124211,
 1124211,
 ROBERT B  VIONA D, DENNIS and MARY SUE, BRIAN   JOANNE,
 1124211, RONALD and  GAIL, Mike and Mary Lou, 31763006,
 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE,
 1124211, WAYNE  LU ANN, SCOTT  ANNA MARIE, 1124211,
 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA
 S,
 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA
 LUZ,
 G  S COMPUTERIZED GRADING, 1124211, 1124211, 1124211,
 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211,
 Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and
 A. Judith,
 MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER),
 FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
 FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
 21L, 15L), FNAME_PATTERN = c(999,
 A,_AA,_AA_AAA_,
 99, , 999, A___AAA,
 _A_A_A, , 9_A, A___A.,
 999, 999, AA_A__A_A,
 AA_AAA__AAA,
 A___AA, 999, AA_AAA__,
 _AAA__AAA,
 , 9, , _AAA__,
 A_AAA__A,
 999, A__AA_AAA, A___A,
 999,
 999, 99, A,_AA_AAA_AAA,
 ___A,
 99, , A_AA_AA_AAA,
 A_AA_AA_AAA,
 A__A__AAA, 999, 999,
 999,
 999, A_AA_AA_AAA, AA_AAA_AA_A,
 999,
 AAA_A._AAA_A_A., 999, 999,
 A_A._AAA_A._AA,
 AA_AA_AA_AAA, A,_AAA,_AAA_A, AA_AAA_AAA_
 ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L,
 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L,
 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L,
 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names =
 c(PERSON_FIRST_NAME,
 FNAME_SUSPECT, FNAME_LENGTH, FNAME_PATTERN,
 FNAME_TOKEN_COUNT
 ), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L,
 25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L,
 67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L,
 84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L,
 128305L,
 129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L,
 155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L,
 175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class =
 data.frame)

 Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to
 change that
 later.

 I've tried running this:
 if(mydata$FNAME_TOKEN_COUNT  3 | mydata$FNAME_LENGTH

Re: [R] Updating a data frame based on if condition

2014-02-18 Thread Jeff Johnson
Thanks David, that's a great improvement.


On Tue, Feb 18, 2014 at 12:36 PM, David Carlson dcarl...@tamu.edu wrote:

 What you have can work, but it will be hard to maintain and
 debug. Easier to follow is

  cond1 - mydata$FNAME_TOKEN_COUNT  3
  cond2 - mydata$FNAME_LENGTH  55
  cond3 - regexpr(9, mydata$FNAME_PATTERN) == 0
   mydata$FNAME_SUSPECT - apply(cbind(cond1, cond2, cond3), 1,
 any)
  mydata$FNAME_SUSPECT
  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
 FALSE FALSE
 [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
 TRUE FALSE
 [25]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
 TRUE  TRUE
 [37] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
 TRUE  TRUE
 [49]  TRUE  TRUE

 And adding or changing a condition is pretty simple

 David C

 From: Jeff Johnson [mailto:mrjeffto...@gmail.com]
 Sent: Tuesday, February 18, 2014 12:54 PM
 To: dcarl...@tamu.edu
 Cc: R help
 Subject: Re: [R] Updating a data frame based on if condition

 Ahh, I was specifying the second argument FALSE incorrectly.
 Works now as:

 mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT  3,
 TRUE,
  ifelse(mydata$FNAME_LENGTH  55, TRUE,
 ifelse(regexpr(9, mydata$FNAME_PATTERN) ==
 0, TRUE, FALSE
)
   )
 )


 On Tue, Feb 18, 2014 at 10:21 AM, Jeff Johnson
 mrjeffto...@gmail.com wrote:
 This is my first time with ifelse, but I've tried:

 mydata$FNAME_SUSPECT - ifelse(mydata$FNAME_TOKEN_COUNT  3,
 TRUE, FALSE,
  ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
 ifelse(regexpr(9, mydata$FNAME_PATTERN) 
 0, TRUE, FALSE
)
   )
 )

 Error in ifelse(mydata$FNAME_TOKEN_COUNT  3, TRUE, FALSE,
 ifelse(mydata$FNAME_LENGTH   :
   unused argument (ifelse(mydata$FNAME_LENGTH  35, TRUE, FALSE,
 ifelse(regexpr(9, mydata$FNAME_PATTERN)  0, TRUE, FALSE)))

 I have the R for Dummies book which covers it a bit, but I just
 ordered the R Cookbook.

 On Tue, Feb 18, 2014 at 10:16 AM, David Carlson
 dcarl...@tamu.edu wrote:
 Not always true, but it is in this case:

 ?ifelse

 David C

 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Johnson
 Sent: Tuesday, February 18, 2014 11:24 AM
 To: R help
 Subject: [R] Updating a data frame based on if condition

 I have a subset of data that I have identified as suspect (for
 example,
 the first name has excessive spaces, is longer than 35
 characters or has a
 number).

 What I want to do is update the FNAME_SUSPECT field in mydata
 to TRUE if
 any of those conditions are met.

 Here's my data:
  dput(mydata)
 structure(list(PERSON_FIRST_NAME = c(1298530, JULIA, TAYLOR,
 CS AND
 JEFF,
 88, 4465891170098562, 1124211, LEWIS  MARY KAY, KARL R
 O S,
 5466181820076010, JULI0 C, WAYNE   T., 1124211,
 1124211,
 ROBERT B  VIONA D, DENNIS and MARY SUE, BRIAN   JOANNE,
 1124211, RONALD and  GAIL, Mike and Mary Lou, 31763006,
 7, 11460735, Paul and Mary Beth, JIMMY and RUTH MARIE,
 1124211, WAYNE  LU ANN, SCOTT  ANNA MARIE, 1124211,
 1124211, 952714, DAVID, RHONDA and NATALIE, VIRGINIA
 S,
 707069, 4397836190001917, MARIA DE LA LUZ, MARIA DE LA
 LUZ,
 G  S COMPUTERIZED GRADING, 1124211, 1124211, 1124211,
 1124211, MARIA DE LA LUZ, ED AND JANICE KISHI, 1124211,
 Garrett A. and Jenny E., 1124211, 1124211, Hiram T. and
 A. Judith,
 MA DE LA LUZ, STEVE, Bev, and Caleb, MR AND MRS EVER),
 FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
 FALSE,
 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
 FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
 21L, 15L), FNAME_PATTERN = c(999,
 A,_AA,_AA_AAA_,
 99, , 999, A___AAA,
 _A_A_A, , 9_A, A___A.,
 999, 999, AA_A__A_A,
 AA_AAA__AAA,
 A___AA, 999, AA_AAA__,
 _AAA__AAA,
 , 9, , _AAA__,
 A_AAA__A,
 999, A__AA_AAA, A___A,
 999,
 999, 99, A,_AA_AAA_AAA,
 ___A,
 99, , A_AA_AA_AAA,
 A_AA_AA_AAA,
 A__A__AAA, 999, 999,
 999,
 999, A_AA_AA_AAA, AA_AAA_AA_A,
 999,
 AAA_A._AAA_A_A., 999, 999,
 A_A._AAA_A._AA

[R] In RStudio/Win7, which directory stores the markdown.css file?

2014-02-03 Thread Jeff Johnson
I'm running Windows 7 and RStudio .98.490. I need to edit the CSS file to
test something out, but I've found multiple files and changing them seems
to do nothing. It almost seems like the CSS may be cached or something
since all changes to it do nothing. FYI, I've read the tutorials on custom
CSS as well.

Thanks in advance.

-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling font size on code chunk outputs using Knitr

2014-01-31 Thread Jeff Johnson
Yihui/Jeff,

I'm trying to determine where the default CSS file is located as I don't
see this in any of the documentation. I can definitely find a markdwon.css
file in  C:\Program Files\RStudio\resources

I also see an R.css file in that directory.

I also have R.css in C:\Users\jeffjohn\Dropbox\R\Rlibs\rstudio\html which
is where I have all of my packages installed. Would you know how I can
determine what CSS file a given .Rmd file is referencing?

However, I've tried making a simple change to each of them (first backing
them up of course) by changing the h1 to small instead of x-large and
saving the doc, but when I knit the document it does not change anything.

Any guidance you can provide would be extremely helpful. Again, I'm using
R-Studio on Windows.


On Thu, Jan 30, 2014 at 3:54 PM, Jeff Johnson mrjeffto...@gmail.com wrote:

 Thanks Yihui and Jeff.

 I've retrieved the default CSS file and made a tweak to it (changing a
 header 1 size just to test it) and saved it to the same local directory as
 my .Rmd file using the name 'mymarkdown.css' for testing.

 I've added:
 options(rstudio.markdownToHTML =
   function(inputFile, outputFile) {
 require(markdown)
 markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css')
   }
 )

 to the top of my testfile.Rmd file so that my file now looks like:

 options(rstudio.markdownToHTML =
   function(inputFile, outputFile) {
 require(markdown)
 markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css')
   }
 )

 Title
 

 This is an R Markdown document. Markdown is a simple formatting syntax for
 authoring web pages (click the **Help** toolbar button for more details on
 using R Markdown).

 When you click the **Knit HTML** button a web page will be generated that
 includes both content as well as the output of any embedded R code chunks
 within the document. You can embed an R code chunk like this:

 ```{r}
 summary(cars)
 ```

 But when I knit it, it just writes the options chunk at the top of my
 document. Am I supposed to add something else to get the .rmd file to
 reference the css?

 I'm quite new to programming and R (as if you couldn't tell!), so not sure
 what additional steps I need to add.

 Thanks much.
 Jeff



 On Thu, Jan 30, 2014 at 1:48 PM, Yihui Xie x...@yihui.name wrote:

 Exactly. Please see RStudio documentation:

 https://support.rstudio.com/hc/en-us/articles/200552186-Customizing-Markdown-Rendering

 Regards,
 Yihui
 --
 Yihui Xie xieyi...@gmail.com
 Web: http://yihui.name


 On Thu, Jan 30, 2014 at 10:57 AM, Jeff Newmiller
 jdnew...@dcn.davis.ca.us wrote:
  This sounds like a classic you need to write a custom CSS file
 problem... Which is off-topic here, so is homework for you.
 
  On January 30, 2014 8:34:32 AM PST, Jeff Johnson mrjeffto...@gmail.com
 wrote:
 Hi Yihui,
 
 The package I have installed is knitr. To generate the HTML, I run
 Knit
 HTML from within R Studio version .98.490 (there's an icon to initiate
 it.
 
 
 
 You can load that dataset, then:
 Print the column names
 ```{r, echo=showcode, comment=commentchar}
 colnames(mydf)
 ```
 The resulting font is a couple of points larger than I'd like. I'd like
 to
 be able to control this either globally or at the code chunk level.
 
 Thanks for your help with this!




 --
 Jeff




-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling font size on code chunk outputs using Knitr

2014-01-30 Thread Jeff Johnson
Hi Yihui,

The package I have installed is knitr. To generate the HTML, I run Knit
HTML from within R Studio version .98.490 (there's an icon to initiate it.

Here's a simple example:
showcode - FALSE
commentchar - NA

You can load this data as 'mydf'...
dput(mydf)
structure(list(PERSONPROFILE_POS = DV, PARTY_ID = 95252415L,
PERSON_FIRST_NAME = Julie, PERSON_LAST_NAME = herlastname,
PERSON_MIDDLE_NAME = NA_character_,
PARTY_NUMBER = 49229698L, ACCOUNT_NUMBER = 104205066L, ABILITEC_LINK =
25455695,
ADDRESS1 = 332 SE SOME RD, ADDRESS2 = NA_character_,
ADDRESS3 = NA_character_, ADDRESS4 = NA_character_, CITY = SOMECITY,
COUNTY = SOMECOUNTY, STATE = OR, PROVINCE = NA_character_,
POSTAL_CODE = 97111-, COUNTRY = US, PRIMARY_PER_TYPE = N,
SELLTOADDR_LOS = DV, LOCATION_ID = 6222438L, SELLTOADDR_SOS = DV,
PARTY_SITE_ID = 7292226L, PRIMARYPHONE_CPOS = DV,
CONTACT_POINT_ID_PCP = 62243903L,
CONTACT_POINT_PURPOSE_PCP = PERSONAL, PHONE_LINE_TYPE = GEN,
PRIMARY_FLAG_PCP = Y, PHONE_COUNTRY_CODE = NA_integer_,
PHONE_AREA_CODE = 244, PHONE_NUMBER = 244, EMAIL_CPOS = DV,
CONTACT_POINT_ID_ECP = 6202L, CONTACT_POINT_PURPOSE_ECP =
NA_character_,
PRIMARY_FLAG_ECP = Y, EMAIL_ADDRESS = someem...@yahoo.com,
BB_PARTY_ID = NA, VALID_COUNTRY = TRUE, VALID_USSTATE = TRUE,
POSTAL_PATTERN = 9-, VALID_USPP = TRUE, FULL_PHONE =
244244,
FULL_PHONE_PATTERN = NA99, FNAME_PATTERN = A,
FNAME_LENGTH = 5L, FNAME_TOKEN_COUNT = 1L, LNAME_LENGTH = 4L,
LNAME_PATTERN = , MNAME_LENGTH = 2L, MNAME_PATTERN =
NA_character_,
MNAME_TOKEN_COUNT = 1L, LNAME_TOKEN_COUNT = 1L, EMAIL_LENGTH = 19L,
VALID_EMAIL = TRUE), .Names = c(PERSONPROFILE_POS, PARTY_ID,
PERSON_FIRST_NAME, PERSON_LAST_NAME, PERSON_MIDDLE_NAME,
PARTY_NUMBER, ACCOUNT_NUMBER, ABILITEC_LINK, ADDRESS1,
ADDRESS2, ADDRESS3, ADDRESS4, CITY, COUNTY, STATE,
PROVINCE, POSTAL_CODE, COUNTRY, PRIMARY_PER_TYPE, SELLTOADDR_LOS,
LOCATION_ID, SELLTOADDR_SOS, PARTY_SITE_ID, PRIMARYPHONE_CPOS,
CONTACT_POINT_ID_PCP, CONTACT_POINT_PURPOSE_PCP, PHONE_LINE_TYPE,
PRIMARY_FLAG_PCP, PHONE_COUNTRY_CODE, PHONE_AREA_CODE,
PHONE_NUMBER, EMAIL_CPOS, CONTACT_POINT_ID_ECP,
CONTACT_POINT_PURPOSE_ECP,
PRIMARY_FLAG_ECP, EMAIL_ADDRESS, BB_PARTY_ID, VALID_COUNTRY,
VALID_USSTATE, POSTAL_PATTERN, VALID_USPP, FULL_PHONE,
FULL_PHONE_PATTERN, FNAME_PATTERN, FNAME_LENGTH, FNAME_TOKEN_COUNT,
LNAME_LENGTH, LNAME_PATTERN, MNAME_LENGTH, MNAME_PATTERN,
MNAME_TOKEN_COUNT, LNAME_TOKEN_COUNT, EMAIL_LENGTH, VALID_EMAIL
), row.names = 1L, class = data.frame)

You can load that dataset, then:
Print the column names
```{r, echo=showcode, comment=commentchar}
colnames(mydf)
```
The resulting font is a couple of points larger than I'd like. I'd like to
be able to control this either globally or at the code chunk level.

Thanks for your help with this!


On Wed, Jan 29, 2014 at 5:57 PM, Yihui Xie x...@yihui.name wrote:

 Please provide a minimal example -- are you using R Markdown or R
 HTML? Both can produce HTML output:
 http://yihui.name/knitr/demo/minimal/

 Regards,
 Yihui
 --
 Yihui Xie xieyi...@gmail.com
 Web: http://yihui.name


 On Wed, Jan 29, 2014 at 10:49 AM, Jeff Johnson mrjeffto...@gmail.com
 wrote:
  Hi there,
  I'm currently using knitr to generate an html file, however the output of
  my code is in a font size that's larger than I desire. I've been looking
  through various options for controlling the font size of the code
 results,
  such as the knitr manual, opts_chunk, and latex.
 
  The actual code itself is not being outputted as desired (I set
 echo=FALSE
  intentionally). However, I wish to make the results of executing the
 code a
  couple of font sizes smaller. I'll likely wish to have all code output
  chunks be smaller, so a global setting is fine, though I would also
  appreciate understanding how to control it at the chunk level as well.
 
  Does any one have a recommendation on how to do this? Lots of discussion
 on
  Google, but I don't see any tangible results. I'm still pretty new to R
  however.
 
  Thanks in advance.
  --
  Jeff
 
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling font size on code chunk outputs using Knitr

2014-01-30 Thread Jeff Johnson
Thanks Yihui and Jeff.

I've retrieved the default CSS file and made a tweak to it (changing a
header 1 size just to test it) and saved it to the same local directory as
my .Rmd file using the name 'mymarkdown.css' for testing.

I've added:
options(rstudio.markdownToHTML =
  function(inputFile, outputFile) {
require(markdown)
markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css')
  }
)

to the top of my testfile.Rmd file so that my file now looks like:

options(rstudio.markdownToHTML =
  function(inputFile, outputFile) {
require(markdown)
markdownToHTML(inputFile, outputFile, stylesheet='mymarkdown.css')
  }
)

Title


This is an R Markdown document. Markdown is a simple formatting syntax for
authoring web pages (click the **Help** toolbar button for more details on
using R Markdown).

When you click the **Knit HTML** button a web page will be generated that
includes both content as well as the output of any embedded R code chunks
within the document. You can embed an R code chunk like this:

```{r}
summary(cars)
```

But when I knit it, it just writes the options chunk at the top of my
document. Am I supposed to add something else to get the .rmd file to
reference the css?

I'm quite new to programming and R (as if you couldn't tell!), so not sure
what additional steps I need to add.

Thanks much.
Jeff



On Thu, Jan 30, 2014 at 1:48 PM, Yihui Xie x...@yihui.name wrote:

 Exactly. Please see RStudio documentation:

 https://support.rstudio.com/hc/en-us/articles/200552186-Customizing-Markdown-Rendering

 Regards,
 Yihui
 --
 Yihui Xie xieyi...@gmail.com
 Web: http://yihui.name


 On Thu, Jan 30, 2014 at 10:57 AM, Jeff Newmiller
 jdnew...@dcn.davis.ca.us wrote:
  This sounds like a classic you need to write a custom CSS file
 problem... Which is off-topic here, so is homework for you.
 
  On January 30, 2014 8:34:32 AM PST, Jeff Johnson mrjeffto...@gmail.com
 wrote:
 Hi Yihui,
 
 The package I have installed is knitr. To generate the HTML, I run
 Knit
 HTML from within R Studio version .98.490 (there's an icon to initiate
 it.
 
 
 
 You can load that dataset, then:
 Print the column names
 ```{r, echo=showcode, comment=commentchar}
 colnames(mydf)
 ```
 The resulting font is a couple of points larger than I'd like. I'd like
 to
 be able to control this either globally or at the code chunk level.
 
 Thanks for your help with this!




-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Controlling font size on code chunk outputs using Knitr

2014-01-29 Thread Jeff Johnson
Hi there,
I'm currently using knitr to generate an html file, however the output of
my code is in a font size that's larger than I desire. I've been looking
through various options for controlling the font size of the code results,
such as the knitr manual, opts_chunk, and latex.

The actual code itself is not being outputted as desired (I set echo=FALSE
intentionally). However, I wish to make the results of executing the code a
couple of font sizes smaller. I'll likely wish to have all code output
chunks be smaller, so a global setting is fine, though I would also
appreciate understanding how to control it at the chunk level as well.

Does any one have a recommendation on how to do this? Lots of discussion on
Google, but I don't see any tangible results. I'm still pretty new to R
however.

Thanks in advance.
-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling font size on code chunk outputs using Knitr

2014-01-29 Thread Jeff Johnson
Thank you Yihui for responding. I'll reply with details when I get in the 
office tomorrow am. I'm using Rstudio and added the knitr package if that 
helps. I'll check details and provide an example tomorrow am. 

I appreciate your help. 

Sent from my iPhone

 On Jan 29, 2014, at 5:57 PM, Yihui Xie x...@yihui.name wrote:
 
 Please provide a minimal example -- are you using R Markdown or R
 HTML? Both can produce HTML output:
 http://yihui.name/knitr/demo/minimal/
 
 Regards,
 Yihui
 --
 Yihui Xie xieyi...@gmail.com
 Web: http://yihui.name
 
 
 On Wed, Jan 29, 2014 at 10:49 AM, Jeff Johnson mrjeffto...@gmail.com wrote:
 Hi there,
 I'm currently using knitr to generate an html file, however the output of
 my code is in a font size that's larger than I desire. I've been looking
 through various options for controlling the font size of the code results,
 such as the knitr manual, opts_chunk, and latex.
 
 The actual code itself is not being outputted as desired (I set echo=FALSE
 intentionally). However, I wish to make the results of executing the code a
 couple of font sizes smaller. I'll likely wish to have all code output
 chunks be smaller, so a global setting is fine, though I would also
 appreciate understanding how to control it at the chunk level as well.
 
 Does any one have a recommendation on how to do this? Lots of discussion on
 Google, but I don't see any tangible results. I'm still pretty new to R
 however.
 
 Thanks in advance.
 --
 Jeff
 
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] KnitR/RMarkdown: Is there a way to not print a section of the document?

2014-01-27 Thread Jeff Johnson
I've been looking through the R documents to see if there's a way to not
output certain chunks of code. I'm trying to present a document to a team
of folks that won't necessarily be interested in the line-by-line code,
though they are interested in the charts, etc. Thus, I'd like to not output
certain chunks of code. Is there a way to suppress sections?

Thank you.

-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] subset and na.rm not really suppressing NA values

2014-01-22 Thread Jeff Johnson
I have a dataset mydf with a field EMAIL_ADDRESS. When importing, I
specified:
mydf - read.csv(file = extract, header = TRUE, stringsAsFactors = FALSE,
na.strings=c(NA,))

I've also tried setting na.strings= c(NA,,NA) but I don't know if
it's appropriate to put NA there.

I'm running
a - subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select =
EMAIL_ADDRESS)
dput(head(a,5))

structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_)), .Names = EMAIL_ADDRESS,
row.names = c(17L,
22L, 23L, 24L, 30L), class = data.frame)

The results show a lot of NA values on screen and in the dput statement.

I don't quite understand why it is doing that. I would have expected it to
exclude those since I had the na.rm = TRUE statement. Do you have any
suggestions?

Thanks!
-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Any recommendations for reusable profiling of name fields?

2014-01-17 Thread Jeff Johnson
Hi, I'm pretty new to R and am trying to develop a reusable set of scripts
that I can use to profile various data types and common fields in our
database. I know that what I'm asking is a can of worms, so please bear
with me. :)

For example, we store a person's first name, last name, phone number, email
address, last gift amount, gift date, etc. as well as integer type data.
I'm wondering if there's a best practice for validating a field that
holds, for example, first name or last name. A couple of things I've come
up with are:
1) Count of characters (nchar) in the first (or last) name field
2) Number of unique tokens
3) Patterns (converting alpha to A and numeric to N) and count the
frequency of each unique pattern that results.I suppose I could make lower
case alpha 'a' and upper = 'A' to be more specific.
4) Min and max name (helps identify those with leading spaces, numbers)

Does anyone have more suggestions for techniques that are common or that
you'd recommend for name fields? Ultimately, I'm looking to develop a
common set of profiles for various data types, so if there's a white paper
(I've googled, but not found any that hit the mark yet) I'd love to see it.

Perhaps there's even a package for this type of thing?

Thanks much!

-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] For loop on column names

2014-01-17 Thread Jeff Johnson
I'm trying to find a more efficient to calculate the percent a field is
populated and repeat it for each field (column).

First, I'm counting the number of lines:
lines - as.integer(countLines(extract) - 1)
dput(lines)
10L

extract - 'C:/Users/jeffjohn/Desktop/batchextract_100k_sample.csv'
mydf - read.csv(file = extract, header = TRUE)

Here's the list of columns in my file:
 dput(colnames(mydf))
c(PERSONPROFILE_POS, PARTY_ID, PERSON_FIRST_NAME, PERSON_LAST_NAME,
PERSON_MIDDLE_NAME, PARTY_NUMBER, ACCOUNT_NUMBER, ABILITEC_LINK,
ADDRESS1, ADDRESS2, ADDRESS3, ADDRESS4, CITY, COUNTY,
STATE, PROVINCE, POSTAL_CODE, COUNTRY, PRIMARY_PER_TYPE,
SELLTOADDR_LOS, LOCATION_ID, SELLTOADDR_SOS, PARTY_SITE_ID,
PRIMARYPHONE_CPOS, CONTACT_POINT_ID_PCP, CONTACT_POINT_PURPOSE_PCP,
PHONE_LINE_TYPE, PRIMARY_FLAG_PCP, PHONE_COUNTRY_CODE,
PHONE_AREA_CODE, PHONE_NUMBER, EMAIL_CPOS, CONTACT_POINT_ID_ECP,
CONTACT_POINT_PURPOSE_ECP, PRIMARY_FLAG_ECP, EMAIL_ADDRESS,
BB_PARTY_ID)

I want to count the percentage populated for each field. Rather than do:
percent(length(is.null(mydf$PERSONPROFILE_POS)) / lines)
percent(length(is.null(mydf$PARTY_ID)) / lines)
etc.
and repeat for each field manually, I want to use a for loop.

I am trying the following:
a - length(colnames(mydf)) # this is to get the total number of columns

for (i in 1:a)
 print((percent(length(is.null(a)) / lines))

which isn't correct. I'm new to programming, so I don't quite know how to
deal with this. Any suggestions? Thanks much.
-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Any recommendations for reusable profiling of name fields?

2014-01-16 Thread Jeff Johnson
Hi, I'm pretty new to R and am trying to develop a reusable set of scripts
that I can use to profile various data types and common fields in our
database. I know that what I'm asking is a can of worms, so please bear
with me. :)

For example, we store a person's first name, last name, phone number, email
address, last gift amount, gift date, etc. as well as integer type data.
I'm wondering if there's a best practice for validating a field that
holds, for example, first name or last name. A couple of things I've come
up with are:
1) Count of characters (nchar) in the first (or last) name field
2) Number of unique tokens
3) Patterns (converting alpha to A and numeric to N) and count the
frequency of each unique pattern that results.I suppose I could make lower
case alpha 'a' and upper = 'A' to be more specific.
4) Min and max name (helps identify those with leading spaces, numbers)

Does anyone have more suggestions for techniques that are common or that
you'd recommend for name fields? Ultimately, I'm looking to develop a
common set of profiles for various data types, so if there's a white paper
(I've googled, but not found any that hit the mark yet) I'd love to see it.

Perhaps there's even a package for this type of thing?

Thanks much!

-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Barplot not showing all labels

2014-01-15 Thread Jeff Johnson
Sorry guys, I'm running into an issue. I have a data frame. Here is the
dput output having run:
dput(head((non_us),25), file = C:/Users/jeffjohn/Desktop/non_us_sam.csv,
control = c(keepNA, keepInteger,showAttributes))

structure(list(COUNTRY = structure(c(4L, 25L, 35L, 12L, 4L, 5L,
14L, 14L, 14L, 12L, 62L, 28L, 9L, 41L, 14L, 34L, 66L, 41L, 21L,
32L, 4L, 9L, 14L, 4L, 28L), .Label = c(AE, AR, AT, AU,
BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH,
CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI,
FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT,
JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG,
NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU,
SA, SE, SG, TC, TH, TT, TW, TZ, ZA), class = factor)),
.Names = COUNTRY, row.names = c(329L,
1146L, 1474L, 1491L, 1585L, 1997L, 2190L, 2382L, 2442L, 2499L,
2703L, 3151L, 3278L, 3652L, 4730L, 5106L, 5214L, 5447L, 5710L,
5924L, 6185L, 6204L, 6258L, 6383L, 6811L), class = data.frame)

This data frame is called non_us

I want to plot it so that it shows a chart of COUNTRY and the frequency of
each (pretty simple I think). However, I don't know what to pass in for
'aes'.

When I type names(non_us) it only shows COUNTRY

Any suggestions for what to use for X and Y (assuming both are needed)?
ggplot(non_us, aes(x=?, y=?))+ geom_bar(stat = identity, colour = red)
+ coord_flip()

I appreciate your help VERY MUCH!

Jeff
World Vision

On Tue, Jan 14, 2014 at 3:44 PM, Jeff Johnson mrjeffto...@gmail.com wrote:

 Thanks John (and everyone else as well). John's example got it very close.
 I can tweak from here. Thanks!


 On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote:

 I am not sure that I got the data correctly--it is much better to supply
 sample data using dput(). See ?dput for more information but I think
 something like this will work

 dat1 / -  structure(list(cty = structure(1:70, .Label = c(AE, AN,
 AR,
 AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS,
 CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC,
 ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN,
 IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX,
 MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT,
 RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, US,
 ZA), class = factor), val = c(0, 3, 0, 2, 1, 31, 4, 1, 1,
 1, 45, 1, 1, 4, 5, 86, 3, 1, 8, 1, 2, 1, 8, 2, 1, 2, 4, 2, 4,
 35, 3, 3, 14, 3, 5, 2, 5, 1, 2, 1, 15, 1, 11, 2, 2, 1, 1, 23,
 7, 1, 6, 1, 3, 1, 2, 1, 1, 8, 1, 1, 1, 1, 1, 18, 1, 1, 2, 11,
 1, 0)), .Names = c(cty, val), row.names = c(NA, -70L), class =
 data.frame)

 library(ggplot2)
 ggplot(dat1, aes(cty, val))+ geom_bar(stat = identity, colour = red)
 + coord_flip()

 It will take some cleaning  up using theme() but I think it supplies the
 essentials that you want.

 John Kane
 Kingston ON Canada


  -Original Message-
  From: mrjeffto...@gmail.com
  Sent: Mon, 13 Jan 2014 11:15:46 -0800
  To: r-help@r-project.org
  Subject: [R] Barplot not showing all labels
 
  I have a table that consists of the following country codes and
  frequencies:
 AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK
  DO
  EC ES
   0  3  0  2  1 31  4  1  1  1 45  1  1  4  5 86  3  1  8  1  2  1  8  2
  1
   2  4
  FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL
  NO
  NZ PA
   2  4 35  3  3 14  3  5  2  5  1  2  1 15  1 11  2  2  1  1 23  7  1  6
  1
   3  1
  PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
   2  1  1  8  1  1  1  1  1 18  1  1  2 11  1  0  3
 
  I am executing:
  non_us - table(subset(mydf, (COUNTRY %in% validcountries)  COUNTRY !=
  US, select = COUNTRY))
 
  barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count
  of
  Non-US Records by Country,col=red)
 
  It creates the attached image (I hope images come through on email).
  Notice
  that it is not displaying all of the country codes. It shows bars for
  each
  country, but only 6 are appearing.
 
  Does anyone have a suggestion? I'm open to using qplot, ggplot or
 ggplot2
  (and have tried that), but I want a bar (horizontal) chart not a column
  chart.
 
  Thanks in advance.
 
  --
  Jeff
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 
 FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks  orcas on
 your desktop!
 Check it out at http://www.inbox.com/marineaquarium





 --
 Jeff




-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Barplot not showing all labels

2014-01-15 Thread Jeff Johnson
Thanks John.

Yes I do need to aggregate. I was thinking that ggplot would do the
aggregating, but in any event, am now trying this:
n - data.frame(table(non_us))
names(n) - c(COUNTRY, FREQ)
which then gives me:
 dput(n)
structure(list(COUNTRY = structure(1:68, .Label = c(AE, AR,
AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS,
CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC,
ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN,
IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX,
MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT,
RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, ZA
), class = factor), FREQ = c(3L, 2L, 1L, 31L, 4L, 1L, 1L, 1L,
45L, 1L, 1L, 4L, 5L, 86L, 3L, 1L, 8L, 1L, 2L, 1L, 8L, 2L, 1L,
2L, 4L, 2L, 4L, 35L, 3L, 3L, 14L, 3L, 5L, 2L, 5L, 1L, 2L, 1L,
15L, 1L, 11L, 2L, 2L, 1L, 1L, 23L, 7L, 1L, 6L, 1L, 3L, 1L, 2L,
1L, 1L, 8L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 2L, 11L, 1L, 3L)), .Names =
c(COUNTRY,
FREQ), row.names = c(NA, -68L), class = data.frame)

Then I do the following thinking that it would create the proper chart:
p - ggplot(n, aes(x=COUNTRY, Y=FREQ)) + geom_bar() + coord_flip()
p
However, what I get is the x axis showing 'count' with a scale of 0.00 to
1.00. So then I try to change the limit of x to be from 0 to 100
p - ggplot(n, aes(x=COUNTRY, Y=FREQ)) + geom_bar() + coord_flip()  +
xlim(0,100)
but I get an error: Error: Discrete value supplied to continuous scale.
I've tried googling that error and people talk about the data type not
being right, but for me str(n) shows
'data.frame': 68 obs. of  2 variables:
 $ COUNTRY: Factor w/ 68 levels AE,AR,AT,..: 1 2 3 4 5 6 7 8 9 10 ...
 $ FREQ   : int  3 2 1 31 4 1 1 1 45 1 ...

To confirm, when attempting to plot a count of occurrences by country in a
data frame with multiple possible rows per country, you have to aggregate
BEFORE passing it to ggplot?

I appreciate your time.


On Wed, Jan 15, 2014 at 12:58 PM, John Kane jrkrid...@inbox.com wrote:

 Thanks for the dput() data.frame.  It makes looking at the problem a lot
 easier.

 Basically you have a mucked-up data.frame. That is, what you see is not
 what you think you have.   You only have one variable in the data.frame and
 that is the country names.

 For some reason the numbers are being considered as row names not as a
 variable.  Do a str(filename) to see what is happening.  You do need to
 have an x and y value.

 Try something like this:
 library(ggplot2)
 dat1$val  -  rownames(dat1) # create a new y value from the row names
 ggplot(dat1, aes(COUNTRY, val))+
   geom_bar(stat = identity, colour = blue, fill = 'red', position =
 dodge) +
   coord_flip()

 It''s not  very pretty but it may give you a start. BTW I see that some
 countries (GB, CA, Au amongst others)  have multiple entries. Does this
 make sense or should you aggregate before graphing?

 John Kane
 Kingston ON Canada

 -Original Message-
 From: mrjeffto...@gmail.com
 Sent: Wed, 15 Jan 2014 09:20:11 -0800
 To: jrkrid...@inbox.com
 Subject: Re: [R] Barplot not showing all labels

 Sorry guys, I'm running into an issue. I have a data frame. Here is the
 dput output having run:

 dput(head((non_us),25), file = C:/Users/jeffjohn/Desktop/non_us_sam.csv,
 control = c(keepNA, keepInteger,showAttributes))

 structure(list(COUNTRY = structure(c(4L, 25L, 35L, 12L, 4L, 5L,

 14L, 14L, 14L, 12L, 62L, 28L, 9L, 41L, 14L, 34L, 66L, 41L, 21L,

 32L, 4L, 9L, 14L, 4L, 28L), .Label = c(AE, AR, AT, AU,

 BB, BD, BE, BH, BM, BN, BO, BR, BS, CA, CH,

 CM, CN, CO, CR, CY, DE, DK, DO, EC, ES, FI,

 FR, GB, GR, GU, HK, ID, IE, IL, IN, IO, IT,

 JM, JP, KH, KR, KY, LU, LV, MO, MX, MY, NG,

 NL, NO, NZ, PA, PE, PG, PH, PR, PT, RO, RU,

 SA, SE, SG, TC, TH, TT, TW, TZ, ZA), class = factor)),
 .Names = COUNTRY, row.names = c(329L,

 1146L, 1474L, 1491L, 1585L, 1997L, 2190L, 2382L, 2442L, 2499L,

 2703L, 3151L, 3278L, 3652L, 4730L, 5106L, 5214L, 5447L, 5710L,

 5924L, 6185L, 6204L, 6258L, 6383L, 6811L), class = data.frame)

 This data frame is called non_us

 I want to plot it so that it shows a chart of COUNTRY and the frequency of
 each (pretty simple I think). However, I don't know what to pass in for
 'aes'.

 When I type names(non_us) it only shows COUNTRY

 Any suggestions for what to use for X and Y (assuming both are needed)?

 ggplot(non_us, aes(x=?, y=?))+ geom_bar(stat = identity, colour = red)
 + coord_flip()

 I appreciate your help VERY MUCH!

 Jeff

 World Vision

 On Tue, Jan 14, 2014 at 3:44 PM, Jeff Johnson mrjeffto...@gmail.com
 wrote:

 Thanks John (and everyone else as well). John's example got it very close.
 I can tweak from here. Thanks!

 On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote:

 I am not sure that I got the data correctly--it is much better to
 supply sample data using dput(). See ?dput for more information but I think
 something like this will work

  dat1 / -  structure(list(cty = structure(1:70, .Label = c(AE, AN,
 AR,

 AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS,
  CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC,

 ES, FI

[R] Subsetting on multiple criteria (AND condition) in R

2014-01-14 Thread Jeff Johnson
I'm running the following to get what I would expect is a subset of
countries that are not equal to US AND COUNTRY is not in one of my
validcountries values.

non_us - subset(mydf, (COUNTRY %in% validcountries)  COUNTRY != US,
select = COUNTRY, na.rm=TRUE)

however, when I then do table(non_us) I get:
 table(non_us)
non_us
   AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO
EC ES
 0  3  0  2  1 31  4  1  1  1 45  1  1  4  5 86  3  1  8  1  2  1  8  2  1
 2  4
FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO
NZ PA
 2  4 35  3  3 14  3  5  2  5  1  2  1 15  1 11  2  2  1  1 23  7  1  6  1
 3  1
PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
 2  1  1  8  1  1  1  1  1 18  1  1  2 11  1  0  3


Notice US appears as the second to last. I expected it to NOT appear.

Do you know if I'm using incorrect syntax? Is the  symbol equivalent to
AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != US
valid syntax? I don't get errors, but then again I don't get what I expect
back.

Thanks in advance!



-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Barplot not showing all labels

2014-01-14 Thread Jeff Johnson
Thanks John (and everyone else as well). John's example got it very close.
I can tweak from here. Thanks!


On Tue, Jan 14, 2014 at 1:22 PM, John Kane jrkrid...@inbox.com wrote:

 I am not sure that I got the data correctly--it is much better to supply
 sample data using dput(). See ?dput for more information but I think
 something like this will work

 dat1 / -  structure(list(cty = structure(1:70, .Label = c(AE, AN,
 AR,
 AT, AU, BB, BD, BE, BH, BM, BN, BO, BR, BS,
 CA, CH, CM, CN, CO, CR, CY, DE, DK, DO, EC,
 ES, FI, FR, GB, GR, GU, HK, ID, IE, IL, IN,
 IO, IT, JM, JP, KH, KR, KY, LU, LV, MO, MX,
 MY, NG, NL, NO, NZ, PA, PE, PG, PH, PR, PT,
 RO, RU, SA, SE, SG, TC, TH, TT, TW, TZ, US,
 ZA), class = factor), val = c(0, 3, 0, 2, 1, 31, 4, 1, 1,
 1, 45, 1, 1, 4, 5, 86, 3, 1, 8, 1, 2, 1, 8, 2, 1, 2, 4, 2, 4,
 35, 3, 3, 14, 3, 5, 2, 5, 1, 2, 1, 15, 1, 11, 2, 2, 1, 1, 23,
 7, 1, 6, 1, 3, 1, 2, 1, 1, 8, 1, 1, 1, 1, 1, 18, 1, 1, 2, 11,
 1, 0)), .Names = c(cty, val), row.names = c(NA, -70L), class =
 data.frame)

 library(ggplot2)
 ggplot(dat1, aes(cty, val))+ geom_bar(stat = identity, colour = red) +
 coord_flip()

 It will take some cleaning  up using theme() but I think it supplies the
 essentials that you want.

 John Kane
 Kingston ON Canada


  -Original Message-
  From: mrjeffto...@gmail.com
  Sent: Mon, 13 Jan 2014 11:15:46 -0800
  To: r-help@r-project.org
  Subject: [R] Barplot not showing all labels
 
  I have a table that consists of the following country codes and
  frequencies:
 AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK
  DO
  EC ES
   0  3  0  2  1 31  4  1  1  1 45  1  1  4  5 86  3  1  8  1  2  1  8  2
  1
   2  4
  FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL
  NO
  NZ PA
   2  4 35  3  3 14  3  5  2  5  1  2  1 15  1 11  2  2  1  1 23  7  1  6
  1
   3  1
  PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
   2  1  1  8  1  1  1  1  1 18  1  1  2 11  1  0  3
 
  I am executing:
  non_us - table(subset(mydf, (COUNTRY %in% validcountries)  COUNTRY !=
  US, select = COUNTRY))
 
  barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count
  of
  Non-US Records by Country,col=red)
 
  It creates the attached image (I hope images come through on email).
  Notice
  that it is not displaying all of the country codes. It shows bars for
  each
  country, but only 6 are appearing.
 
  Does anyone have a suggestion? I'm open to using qplot, ggplot or ggplot2
  (and have tried that), but I want a bar (horizontal) chart not a column
  chart.
 
  Thanks in advance.
 
  --
  Jeff
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 
 FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks  orcas on
 your desktop!
 Check it out at http://www.inbox.com/marineaquarium





-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subsetting on multiple criteria (AND condition) in R

2014-01-14 Thread Jeff Johnson
Thanks so much Marc and for those that responded. Mark's suggestion with
droplevels gave me the desired result.

I'm new to figuring out how to post reproducible code. I'll try using the
set.seed and rnorm functions next time and hope that does the trick.
 Thanks everyone!


On Tue, Jan 14, 2014 at 1:05 PM, Marc Schwartz marc_schwa...@me.com wrote:

 On Jan 14, 2014, at 1:38 PM, Jeff Johnson mrjeffto...@gmail.com wrote:

  I'm running the following to get what I would expect is a subset of
  countries that are not equal to US AND COUNTRY is not in one of my
  validcountries values.
 
  non_us - subset(mydf, (COUNTRY %in% validcountries)  COUNTRY != US,
  select = COUNTRY, na.rm=TRUE)
 
  however, when I then do table(non_us) I get:
  table(non_us)
  non_us
AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO
  EC ES
  0  3  0  2  1 31  4  1  1  1 45  1  1  4  5 86  3  1  8  1  2  1  8  2  1
  2  4
  FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL
 NO
  NZ PA
  2  4 35  3  3 14  3  5  2  5  1  2  1 15  1 11  2  2  1  1 23  7  1  6  1
  3  1
  PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
  2  1  1  8  1  1  1  1  1 18  1  1  2 11  1  0  3
 
 
  Notice US appears as the second to last. I expected it to NOT appear.
 
  Do you know if I'm using incorrect syntax? Is the  symbol equivalent to
  AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != US
  valid syntax? I don't get errors, but then again I don't get what I
 expect
  back.
 
  Thanks in advance!
 
 
 
  --
  Jeff


 Review the Details section of ?subset, where you will find the following:

 Factors may have empty levels after subsetting; unused levels are not
 automatically removed. See droplevels for a way to drop all unused levels
 from a data frame.


 Your syntax is fine and the behavior is as expected.

 Regards,

 Marc Schwartz




-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Barplot not showing all labels

2014-01-13 Thread Jeff Johnson
I have a table that consists of the following country codes and frequencies:
   AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO
EC ES
 0  3  0  2  1 31  4  1  1  1 45  1  1  4  5 86  3  1  8  1  2  1  8  2  1
 2  4
FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO
NZ PA
 2  4 35  3  3 14  3  5  2  5  1  2  1 15  1 11  2  2  1  1 23  7  1  6  1
 3  1
PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
 2  1  1  8  1  1  1  1  1 18  1  1  2 11  1  0  3

I am executing:
non_us - table(subset(mydf, (COUNTRY %in% validcountries)  COUNTRY !=
US, select = COUNTRY))

barplot(non_us,horiz=TRUE,xlab = Count, ylab = Country,main= Count of
Non-US Records by Country,col=red)

It creates the attached image (I hope images come through on email). Notice
that it is not displaying all of the country codes. It shows bars for each
country, but only 6 are appearing.

Does anyone have a suggestion? I'm open to using qplot, ggplot or ggplot2
(and have tried that), but I want a bar (horizontal) chart not a column
chart.

Thanks in advance.

-- 
Jeff
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Having a problem with labels

2014-01-09 Thread Jeff Johnson
Hi, I'm having a problem with my labels.

I am reading in a data file:
df - read.csv(file = 'batch1extract_100k_sample.csv')

However, it's producing two sets of labels:

 labels(df)
[[1]]
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14
15 16 17 18 19 20 21
[22] 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56
57 58 59 60 61 62 63
[64] 64 65 66 67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98
99

[[2]]
 [1] PERSONPROFILE_POS PARTY_ID
 PERSON_FIRST_NAME
 [4] PERSON_LAST_NAME  PERSON_MIDDLE_NAMEPARTY_NUMBER

 [7] ACCOUNT_NUMBERABILITEC_LINK ADDRESS1

[10] ADDRESS2  ADDRESS3  ADDRESS4

[13] CITY  COUNTYSTATE

[16] PROVINCE  POSTAL_CODE   COUNTRY

[19] PRIMARY_PER_TYPE  SELLTOADDR_LOSLOCATION_ID

[22] SELLTOADDR_SOSPARTY_SITE_ID
PRIMARYPHONE_CPOS
[25] CONTACT_POINT_ID_PCP  CONTACT_POINT_PURPOSE_PCP
PHONE_LINE_TYPE
[28] PRIMARY_FLAG_PCP  PHONE_COUNTRY_CODE
 PHONE_AREA_CODE
[31] PHONE_NUMBER  EMAIL_CPOS
 CONTACT_POINT_ID_ECP
[34] CONTACT_POINT_PURPOSE_ECP PRIMARY_FLAG_ECP
 EMAIL_ADDRESS
[37] BB_PARTY_ID


Notice I get 2 rows for the labels: the first row is a list of numbers
(which does not appear in my dataset) and the second row which are my
actual labels.

I have no idea why it's returning all of the numbers in the labels command.
They're definitely not there in the input file. Any suggestions?
Thank you!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Patterns on postal codes

2014-01-07 Thread Jeff Johnson
Hi all,

I'm pretty new to R and have a question. I have a postal_code field which
can have a variety of values such as:
For US postal codes: 22942-0173 or 32601
For Canada postal codes: N9YZE6 or S7V 1J9

What I want to do is represent these as patterns, such as:
US: N- or N
Canada: ANAAAN or ANA NAN
where N = any number and A = any alpha character, space = space, etc (other
characters such as ' should be represented as '.

Ultimately I want to count these to see how many have a pattern of
N-, ANA NAN, etc so that I can visualize the outliers.

Does anyone know if there is a built-in function in R to do this?
Currently, the str() function on the postal_code field shows a factor with
90,993 levels which isn't particularly helpful.

Thanks in advance!

-- 
Jeff

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.