[R] Text Mining in R
Hi, Wishing you all well. I am exploring text mining with R. Here is where I need help: 1. The starting point is a data frame worder1<- c("I am, taking 2","are these the three samples?", "He speaks differently to you, aint it !","This is distilled - my dear, now give me $3","I saved 2500 this month.") df1 <- data.frame(id=1:5, words=worder1) here in dput format: dput(df1) structure(list(id = 1:5, words = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("are these the three samples?", "He speaks differently to you, aint it !", "I am, taking 2", "I saved 2500 this month.", "This is distilled - my dear, now give me $3" ), class = "factor")), .Names = c("id", "words"), row.names = c(NA, -5L), class = "data.frame") 2. The corpus rituals ... corp1 <- Corpus(VectorSource(df1$words)) inspect(corp1) class(corp1) corp1 <- tm_map(corp1, removeNumbers) corp1 <- tm_map(corp1, removePunctuation) corp1 <- tm_map(corp1, removeWords, stopwords("english")) corp1 <- tm_map(corp1, stripWhitespace) class(corp1) 3. Getting to the analysis tdm1 <- TermDocumentMatrix(corp1) inspect(tdm1[1:5,]) dtm1 <- DocumentTermMatrix(corp1) inspect(dtm1[1:5,]) 4. Now here is the problem If I do a translation, not in getTransformations(), I am unable to convert to tdm or dtm corp1 <- tm_map(corp1, tolower) class(corp1) tdm1.2 <- TermDocumentMatrix(corp1) dtm1.2 <- DocumentTermMatrix(corp1) The error returned is: Error: inherits(doc, "TextDocument") is not TRUE 5. The explaination on internet suggests either a) corp1 <- tm_map(corp1, content_transformer(tolower)) which in my case returns error: Error in UseMethod("content", x) : no applicable method for 'content' applied to an object of class "character" b) corpus_clean <- tm_map(corp1, PlainTextDocument) which results in loss of all the meta data I will appreciate any help. Lastly to keep the doc ids with R corpus, should the step 2 be changed as: corp1 <- Corpus(DataframeSource(df1)) from: corp1 <- Corpus(VectorSource(df1$words)) Thanks / - Some of the references I explored: http://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument http://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r http://stackoverflow.com/questions/20699111/tm-map-error-message-in-r http://stackoverflow.com/questions/31996891/error-in-usemethodmeta-x-no-applicable-method-for-meta-applied-to-an-ob http://stackoverflow.com/questions/11876740/r-stemming-a-string-document-corpus [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sum of Numeric Values in a DF Column
Dear Gunter / Heiberger, Thanks for the help. This is what I was looking for: > ... and here is a non-dplyr rsolution: > >> z <-gsub("[^[:digit:]]"," ",dd$Lower) > >> sapply(strsplit(z," +"),function(x)sum(as.numeric(x),na.rm=TRUE)) > [1] 105 67 60 100 80 And that would explain, why one could not use "unlist" as a grand sum total was not desired, but rather sum for each of the rows. Br / On Mon, Apr 18, 2016 at 10:57 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: > ... and a slightly more efficient non-dplyr 1-liner: > > > sapply(strsplit(dd$Lower,"[^[:digit:]]"), > function(x)sum(as.numeric(x), na.rm=TRUE)) > > [1] 105 67 60 100 80 > > Cheers, > Bert > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Mon, Apr 18, 2016 at 10:43 AM, Bert Gunter <bgunter.4...@gmail.com> > wrote: > > ... and here is a non-dplyr rsolution: > > > >> z <-gsub("[^[:digit:]]"," ",dd$Lower) > > > >> sapply(strsplit(z," +"),function(x)sum(as.numeric(x),na.rm=TRUE)) > > [1] 105 67 60 100 80 > > > > > > Cheers, > > Bert > > Bert Gunter > > > > "The trouble with having an open mind is that people keep coming along > > and sticking things into it." > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > > > On Mon, Apr 18, 2016 at 10:07 AM, Richard M. Heiberger <r...@temple.edu> > wrote: > >> ## Continuing with your data > >> > >> AA <- stringr::str_extract_all(dd[[2]],"[[:digit:]]+") > >> BB <- lapply(AA, as.numeric) > >> ## I think you are looking for one of the following two expressions > >> sum(unlist(BB)) > >> sapply(BB, sum) > >> > >> > >> On Mon, Apr 18, 2016 at 12:48 PM, Burhan ul haq <ulh...@gmail.com> > wrote: > >>> Hi, > >>> > >>> I request help with the following: > >>> > >>> INPUT: A data frame where column "Lower" is a character containing > numeric > >>> values (different count or occurrences of numeric values in each row, > >>> mostly 2) > >>> > >>>> dput(dd) > >>> structure(list(State = c("Alabama", "Alaska", "Arizona", "Arkansas", > >>> "California"), Lower = c("R 72–33", "R/Coalition 27(23 R, 4 D)–12 D, 1 > >>> Ind.", > >>> "R 36–24", "R 64–35, 1 Ind.", "D 52–28"), Upper = c("R 26–8, 1 Ind.", > >>> "R/Coalition 15(14 R, 1 D)–5 D", "R 18–12", "R 24–11", "D 26–14" > >>> )), .Names = c("State", "Lower", "Upper"), row.names = c(NA, > >>> 5L), class = "data.frame") > >>> > >>> PROBLEM: Need to extract all numeric values and sum them. There are few > >>> exceptions like row2. But these can be ignored and will be fixed > manually > >>> > >>> SOLUTION SO FAR: > >>> str_extract_all(dd[[2]],"[[:digit:]]+"), returns a list of numbers as > >>> character. I am unable to unlist it, because it mixes them all > together, ... > >>> > >>> And if I may add, is there a "dplyr" way of doing it ... > >>> > >>> > >>> Thanks > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> __ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> > >> __ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Sum of Numeric Values in a DF Column
Hi, I request help with the following: INPUT: A data frame where column "Lower" is a character containing numeric values (different count or occurrences of numeric values in each row, mostly 2) > dput(dd) structure(list(State = c("Alabama", "Alaska", "Arizona", "Arkansas", "California"), Lower = c("R 72–33", "R/Coalition 27(23 R, 4 D)–12 D, 1 Ind.", "R 36–24", "R 64–35, 1 Ind.", "D 52–28"), Upper = c("R 26–8, 1 Ind.", "R/Coalition 15(14 R, 1 D)–5 D", "R 18–12", "R 24–11", "D 26–14" )), .Names = c("State", "Lower", "Upper"), row.names = c(NA, 5L), class = "data.frame") PROBLEM: Need to extract all numeric values and sum them. There are few exceptions like row2. But these can be ignored and will be fixed manually SOLUTION SO FAR: str_extract_all(dd[[2]],"[[:digit:]]+"), returns a list of numbers as character. I am unable to unlist it, because it mixes them all together, ... And if I may add, is there a "dplyr" way of doing it ... Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R-help Digest, Vol 157, Issue 25
Thanks to Boris Steipe, Jim Lemon and Ivan Calandra for replying. I messed up while copying, there are equal number of values for each country. @ Ivan, In case there were different number of values, and we wanted to fill in with 1) NA, or 2) "average of the rest of values" in the missing values, how would we "impute" such data. Thanks again / [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Splitting a vector into data frame
Hi, 1. I have scraped some data from the web, subset shown below > dput(temp.data) c("Armenia", "Armenia", "43827", "39200", "35700", "36700", "39341", "30571", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", " 0", "0", "0", "0", "0", "Austria", "Austria", "135417", "166200", "144500", "147300", "163211", "162536", "155412", "133667", "134962", "146440", "131188", "11", "10", "8", "35000") 2. The corresponding list of countries, is as follows > dput(raw.country) c("Armenia", "Austria", "Belarus", "Belgium", "Brazil", "Bulgaria", "Canada", "Castile-Leon (Hiszania)", "Catalonia", "Chile", "Colombia", "Costarica", "Croatia", "Cyprus", "Czech Republic", "Ecuador", "Estonia", "Finland", "France", "Georgia", "Germany", "Ghana", "Greece", "Hungary", "Indonesia", "Iran", "Ireland", "Israel", "Italy", "Kazakhstan", "Kyrgyzstan", "Latvia", "Lithuania", "Macedonia", "Malaysia", "Mexico", "Moldova", "Mongolia", "Netherland", "Norway", "Pakistan", "Panama", "Paraguay", "Peru", "Poland", "Portugal", "Puertorico", "Romania", "Russia", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Tunisia", "Ukraine", "United Kingdom", "USA", "Venezuela", "Vltava", "World Total") 3. I want to organize the data into a data frame, where each row will contain the 20 values for the corresponding country. It needs to ignore the country name which appears twice.Something like: Armenia "43827", "39200", "35700", "36700", "39341", "30571", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", " 0", "0", "0", "0", "0", "Austria", "135417", "166200", "144500", "147300", "163211", "162536", "155412", "133667", "134962", "146440", "131188", "11", "10", "8", "35000" and so on Thanks / [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in upgrading ggplot2
Thanks. I will try both the options 1) another mirror 2) upgrading R, and revert in case of issues. Br / On Fri, Mar 4, 2016 at 10:56 AM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > The usual thing to try in cases like this is another mirror. > > Another worthwhile step is upgrading your R software to the latest... if > only to comply with the Posting Guide. > -- > Sent from my phone. Please excuse my brevity. > > On March 3, 2016 9:33:05 PM PST, Burhan ul haq <ulh...@gmail.com> wrote: >> >> Hi, >> >> I was planning to use GGally, which required me to upgrade ggplot2 but >> despite trying multiple times, I have been unable to do so: >> >> The ggplot2 downloads and installs, but when I load it, I get the following >> message: >> >> library("ggplot2", lib.loc="/usr/local/lib/R/site-library") >>> >> Error in get(method, envir = home) : >> lazy-load database '/usr/local/lib/R/site-library/ggplot2/R/ggplot2.rdb' >> is corrupt >> In addition: Warning message: >> In get(method, envir = home) : internal error -3 in R_decompress1 >> Error: package or namespace load failed for ‘ggplot2’ >> >> The session info is as follows: >> >> sessionInfo() >>> >> R version >> 3.2.2 (2015-08-14) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.1 LTS >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C >> LC_COLLATE=C LC_MONETARY=C >> [6] LC_MESSAGES=CLC_PAPER=C LC_NAME=C >> LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=C LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] scales_0.3.0 reshape2_1.4.1 dplyr_0.4.3 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.12.3 assertthat_0.1 digest_0.6.8 MASS_7.3-40 >> R6_2.1.1 grid_3.2.2 >> [7] plyr_1.8.3 gtable_0.1.2 DBI_0.3.1magrittr_1.5 >> stringi_1.0-1lazyeval_0.1.10 >> [13] proto_0.3-10 tools_3.2.2 stringr_1.0.0munsell_0.4.2 >> parallel_3.2.2 colorspace_1.2-6 >> >> >> Thanks >> >> [[alternative HTML version deleted]] >> >> -- >> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Error in upgrading ggplot2
Hi, I was planning to use GGally, which required me to upgrade ggplot2 but despite trying multiple times, I have been unable to do so: The ggplot2 downloads and installs, but when I load it, I get the following message: > library("ggplot2", lib.loc="/usr/local/lib/R/site-library") Error in get(method, envir = home) : lazy-load database '/usr/local/lib/R/site-library/ggplot2/R/ggplot2.rdb' is corrupt In addition: Warning message: In get(method, envir = home) : internal error -3 in R_decompress1 Error: package or namespace load failed for ‘ggplot2’ The session info is as follows: > sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.1 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C [6] LC_MESSAGES=CLC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=C LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] scales_0.3.0 reshape2_1.4.1 dplyr_0.4.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.3 assertthat_0.1 digest_0.6.8 MASS_7.3-40 R6_2.1.1 grid_3.2.2 [7] plyr_1.8.3 gtable_0.1.2 DBI_0.3.1magrittr_1.5 stringi_1.0-1lazyeval_0.1.10 [13] proto_0.3-10 tools_3.2.2 stringr_1.0.0munsell_0.4.2 parallel_3.2.2 colorspace_1.2-6 Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Grep Help
Hi, # 1) I have read in a CSV file df = read.csv(file="GiftCards - v1.csv",stringsAsFactors=FALSE) head(df) str(df) # 2) converted to a tbl_df df2 = tbl_df(df) # 3) fixed the names to remove leading "X" character n = names(df2) n2 = gsub(pattern="^\\w","\\1",n) names(df2) = n2 # 4) somehow the col names are character strings, requiring me to use quotes: df2$`2006` instead of df2$2006 # ---> PROBLEM 1 # 5) I need to remove the leading $ sign followed by spaces to extract values. The problem is # it could be a two or three digit number. I am able to retrieve two digits correctly, but miss # out on the leading third digit. df2$`2006`= gsub("^(.+)([0-9]{2,3}\\.[0-9]{2})","\\2",df2$`2006`) # --> Problem 2 # 6) dump for the data frame df2 <- structure(list(`2006` = structure(c(3L, 2L, 1L), .Label = c("$ 24.81", "$ 39.16", "$ 146.20"), class = "factor"), `2007` = structure(c(3L, 2L, 1L), .Label = c("$ 26.25", "$ 41.95", "$ 156.24" ), class = "factor"), `2008` = structure(c(3L, 2L, 1L), .Label = c("$ 24.92", "$ 40.54", "$ 147.33"), class = "factor"), `2009` = structure(c(3L, 2L, 1L), .Label = c("$ 23.63", "$ 39.80", "$ 139.91" ), class = "factor"), `2010` = structure(c(3L, 2L, 1L), .Label = c("$ 24.78", "$ 41.48", "$ 145.61"), class = "factor"), `2011` = structure(c(3L, 2L, 1L), .Label = c("$ 27.80", "$ 43.23", "$ 155.43" ), class = "factor"), `2012` = structure(c(3L, 2L, 1L), .Label = c("$ 28.79", "$ 43.75", "$ 156.86"), class = "factor"), `2013` = structure(c(3L, 2L, 1L), .Label = c("$ 29.80", "$ 45.16", "$ 163.16" ), class = "factor")), .Names = c("2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013"), class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -3L)) Thanks for the help Br / [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Input from a Non Delimited File
Hi, Minor Additions: The original file was as follows: ## --- GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime 1 10038 Carl Allwood M Sutton Ashfield Harriers 02:38:40 1 02:38:40 2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25 3 13007 Pumlani Bangani M 02:43:23 3 02:43:23 4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39 5 10187 Peter Stockdale M 02:45:26 5 02:45:25 6 10064 Jared Bethell M Harlow RC 02:46:43 6 02:46:40 7 13003 Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44 8 13009 Rod Harris M 02:47:47 8 02:47:45 9 10033 Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58 10 10037 Peter Swaine M Charnwood AC 02:49:28 10 02:49:27 11 10048 Pavel Toropov M 02:50:41 11 02:50:41 12 10008 Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40 13 10044 Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15 14 10380 Ludovic Renou M 02:53:37 14 02:53:34 15 10056 Alex Keenan M 02:53:48 15 02:53:47 ## --- Available here: http://www.coltishalljaguars.co.uk/wp-content/uploads/2011/09/Robin-hood2011.pdf I am able to match a single entry with the regular expression: ^(\d+),(\d+),( )(.)*(M |F )(.)*(\d{2}):(\d{2}):(\d{2})( )(\d{1,})( )(\d{2}):(\d{2}):(\d{2}) But unable to handle the back reference mechanism well. And put commas to delimit the text. I believe regular expressions pertain to R as much as they do to Sublime, but please let me know, if I should be posting this to sublime forum. \\Cheers On Mon, Feb 10, 2014 at 3:48 AM, Burhan ul haq ulh...@gmail.com wrote: Hi, I am trying to read in a file, which is not delimited by any specific characters. Something as follows: ## --- GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime 1,10038, Carl Allwood M Sutton Ashfield Harriers 02:38:40 1 02:38:40 2,10098, Adam Holland M Votwo/USN 02:41:25 2 02:41:25 3,13007, Pumlani Bangani M 02:43:23 3 02:43:23 4,10028, Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39 5,10187, Peter Stockdale M 02:45:26 5 02:45:25 6,10064, Jared Bethell M Harlow RC 02:46:43 6 02:46:40 7,13003, Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44 8,13009, Rod Harris M 02:47:47 8 02:47:45 9,10033, Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58 10,10037, Peter Swaine M Charnwood AC 02:49:28 10 02:49:27 11,10048, Pavel Toropov M 02:50:41 11 02:50:41 12,10008, Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40 13,10044, Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15 14,10380, Ludovic Renou M 02:53:37 14 02:53:34 15,10056, Alex Keenan M 02:53:48 15 02:53:47 ## --- As I failed to read it in via R or Excel, I used a text editor with regular expressions, sublime to be exact. I was trying to convert it in CSV format, and was successful to put commas for the first two entries, as follows: ## --- GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime 1,10038, Carl Allwood ,M ,Sutton Ashfield Harriers 02:38:40 1 02:38:40 2,10098, Adam Holland ,M ,Votwo/USN 02:41:25 2 02:41:25 3,13007, Pumlani Bangani ,M ,02:43:23 3 02:43:23 4,10028, Anthony Jackson ,M ,Sittingbourne Striders 02:44:39 4 02:44:39 5,10187, Peter Stockdale ,M ,02:45:26 5 02:45:25 6,10064, Jared Bethell ,M ,Harlow RC 02:46:43 6 02:46:40 7,13003, Sarah Harris ,F ,35 Long Eaton RC 02:47:47 7 02:47:44 8,13009, Rod Harris ,M ,02:47:47 8 02:47:45 9,10033, Carl Sommer ,M ,Huncote Harriers 02:47:59 9 02:47:58 10,10037, Peter Swaine ,M ,Charnwood AC 02:49:28 10 02:49:27 11,10048, Pavel Toropov ,M ,02:50:41 11 02:50:41 12,10008, Derek Dunne ,M ,45 Treasury Running Club 02:51:42 12 02:51:40 13,10044, Matthew Nutt ,M ,Scunthorpe 02:52:20 13 02:52:15 14,10380, Ludovic Renou ,M ,02:53:37 14 02:53:34 15,10056, Alex Keenan ,M ,02:53:48 15 02:53:47 ## --- I am failing after that, I tried to search the expression: (.)*(\d{2}:\d{2}:\d{2})( ) and replace it with: \1,\2,\3, with the result: ## --- GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime ,02:38:40, 1 02:38:40 ,02:41:25, 2 02:41:25 ## --- How do I fix the regular expression here. If you examine the later entries some name contains hyphen, or have three parts, so other approaches do not work well. Secondly, is there a better way to handle this problem. The original input file is in pdf format.I copied the text, and made a txt file out of it. The input txt file is attached. Thanks in advance for any suggestions. \\Cheers __ R-help@r-project.org
Re: [R] Generate Variable Length Strings from Various Sources
Hi Rainer, Thanks for the tip. Your suggestion works perfectly, however as per the R Mantra of avoiding for loops, I propose the following this alternate: # number of strings to be created n - 50 # random length of each string v.length = sample( c( 2:4), n, rep = TRUE ) # letter sources src.1 = LETTERS[ 1:10 ] src.2 = LETTERS[ 11:20 ] src.3 = z src.4 = c( 1, 2 ) # turn into a list src - list( src.1, src.2, src.3, src.4 ) my.g = function(len,src) { my.s = src[[ sample( 1:4, 1 ) ]] tmp = sample(my.s,len,rep=TRUE) n1 = paste(tmp,collapse=) n1 } # end sapply(v.length,my.g,src) // Cheers. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Generate Variable Length Strings from Various Sources
Hi, I am trying to generate variable length strings from variable sources as follows: # 8 8- # Function to generate a string, given: # its length(passed as len) # and the source(passed as src) my.f = function(len,src) { tmp = sample(src,len,rep=FALSE) n1 = paste(tmp,collapse=) n1 } # end # count n=50 # length of names, a variable indicating string length v.length = sample(c(2,3,4),n,rep=TRUE) # letter sources src.1 = LETTERS[1:10] src.2 = LETTERS[11:20] src.3 = z src.4 = c(1,2) # Issue #s.ind = sample(c(src.1,src.2),n,rep=TRUE) s.ind = sample(c(src.1,src.3,src.4),n,rep=TRUE) # Generate n strings, whose length is given by v.length, and randomly using sources (src1 to 4) unlist(lapply(v.length,my.f,s.ind)) # 8 8- # ISSUE - Details: How to randomly pass a source, either of source 1, 2, 3 or 4. I have tried with and without the quotes, but it does not work. Without quotes, it works, but then letters are chosen from a randomized mix of all sources, such as A from src.1, z from src.3, whereas I want, only 1 source at a time, for a name. # Result with quotes: dput(r1) c(src.4src.1src.4, src.1src.4src.4, src.4src.3, src.4src.3src.4, src.4src.4, src.1src.4src.4, src.1src.1src.4src.3, src.1src.1src.1src.4, src.4src.1src.4src.3, src.1src.4src.4, src.3src.1src.4, src.4src.3src.1, src.1src.3src.1src.3, src.4src.1src.1src.1, src.4src.3src.4, src.3src.3src.4, src.1src.3src.1src.1, src.3src.3src.1src.4, src.1src.1src.3, src.3src.4src.3, src.3src.4src.3, src.4src.1src.4src.3, src.1src.3src.4src.3, src.4src.1, src.1src.3src.4, src.3src.4src.3, src.4src.3, src.3src.3, src.3src.4, src.4src.4, src.1src.4src.1src.4, src.1src.4src.1, src.3src.3, src.3src.1src.4, src.1src.3src.1src.3, src.3src.4src.1, src.4src.3src.1, src.1src.4src.1src.4, src.3src.4src.1src.4, src.1src.3src.4src.3, src.4src.4src.3, src.4src.1src.3src.1, src.3src.3, src.1src.4src.4, src.4src.1src.4, src.3src.3, src.1src.1, src.3src.1src.1, src.1src.3, src.3src.4src.4src.3) # Result without quotes: dput(r1) c(IGC, B1I, BB, G1C, AE, GBE, 2DJA, CIAG, IGE1, G22, EFD, DGI, BFzB, 1FI1, JFH, EJA, IEzF, FJGB, I2z, IFC, FFE, IzJE, FJ1I, BI, FJG, EJB, GF, AD, IJ, IE, BCGA, G1F, FF, GBB, FGCJ, 1ID, FzA, GJ12, FC2G, FCJ2, zIJ, GHFB, AI, EFB, 2GI, FF, 22, EI1, EG, FC21) Thanks in advance. / Cheers [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Relative Cumulative Frequency of Event Occurence
Hi Arun, Thanks a lot. It works perfectly. Here is the complete code - for all those who are interested to see Rel Cum Freq oscillating to reach the Expected Value # Bernouilli Trial where: v.fly=c(G,B) # Outcome is Green or Blue fly n=100 # No of Events / Trials v.smp = seq(1:n) # Event Id v.fst = sample(v.fly,n,rep=T) # Simulating First Draw v.sec = sample(v.fly,n,rep=T) # Simulating Second Draw df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if color is same in both the the draws df.1$Rel.Freq = with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 # Relative Frequency df.1$Rel.Freq = round(df.1$Rel.Freq,2) ggplot(df.1, aes(x=sample,y=Rel.Freq))+geom_line(col=green,size=2)+geom_abline(intercept=0.5,slope=0)+geom_point(col=blue)+labs(x=Sample No,y=Relative Cum Freq,title=Rel Cum Freq approaching 0.5 Value) + annotate(text,x=60,y=0.53,label=Probability of 0.5) Cheers ! On Thu, Nov 28, 2013 at 9:40 PM, arun smartpink...@yahoo.com wrote: HI, From the dput() version of df.1, it looks like you want: cumsum(df.1[,4]==Yes)/seq_len(nrow(df.1)) [1] 0.000 0.500 0.333 0.250 0.400 0.333 0.4285714 [8] 0.500 0.444 0.500 A.K. On Thursday, November 28, 2013 11:26 AM, Burhan ul haq ulh...@gmail.com wrote: Hi, My objective is to calculate Relative (Cumulative) Frequency of Event Occurrence - something as follows: Sample.Number 1st.Fly 2nd.Fly Did.E.occur? Relative.Cum.Frequency.of.E 1 G B No 0.000 2 B B Yes 0.500 3 B G No 0.333 4 G B No 0.250 5 G G Yes 0.400 6 G B No 0.333 7 B B Yes 0.429 8 G G Yes 0.500 9 G B No 0.444 10 B B Yes 0.500 Please refer to the code below: ## # 1. v.fly=c(G,B) # Outcome is Green or Blue fly # 2. n=10 # No of Events / Trials # 3. v.smp = seq(1:n) # Event Id # 4. v.fst = sample(v.fly,n,rep=T) # Simulating First Draw # 5. v.sec = sample(v.fly,n,rep=T) # Simulating Second Draw # 6. df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF # 7. df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if color is same in both the the draws # 8. df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency This line does NOT work, and needs to fix the denominator part ## Problem is with #8, specifically the part: cumsum(E.occur)/(E.Occur) The denominator E.Occur is a fixed value, instead of a moving count. I have tried nrow(), length() but none provides a moving version of row count, as cumsum does for the True values, occurring so far. dput(df.1) structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B, G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G, B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes, No, No, Yes, No, Yes, Yes, No, Yes), Relative.Cum.Frequency.of.E = c(0, 0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names = c(Sample.Number, X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E ), class = data.frame, row.names = c(NA, -10L)) Cheers ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help with Cast Function
Hi, This is the input data frame: ### df.1 = read.table(header=T,text= id gender WMC_alcohol WMC_caffeine WMC_no.drug RT_alcohol RT_caffeine RT_no.drug 1 1 female 3.7 3.7 3.9 488 236 371 2 2 female 6.4 7.3 7.9 607 376 349 3 3 female 4.6 7.4 7.3 643 226 412 4 4 male 6.4 7.8 8.2 684 206 252 5 5 female 4.9 5.2 7.0 593 262 439 6 6 male 5.4 6.6 7.2 492 230 464 7 7 male 7.9 7.9 8.9 690 259 327 8 8 male 4.1 5.9 4.5 486 230 305 9 9 female 5.2 6.2 7.2 686 273 327 10 10 female 6.2 7.4 7.8 645 240 498 ) ### This is the desired output: ### id gender drug WMC RT 1 1 female alcohol 3.7 488 2 2 female alcohol 6.4 607 3 3 female alcohol 4.6 643 4 4 male alcohol 6.4 684 5 5 female alcohol 4.9 593 6 6 male alcohol 5.4 492 7 7 male alcohol 7.9 690 8 8 male alcohol 4.1 486 9 9 female alcohol 5.2 686 10 10 female alcohol 6.2 645 11 1 female caffeine 3.7 236 12 2 female caffeine 7.3 376 ### I know some melt and cast magic is required. But I was unable to sort it myself. Here are the dput versions: Input Data Frame ### dput(df.1) structure(list(id = 1:10, gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), .Label = c(female, male), class = factor), WMC_alcohol = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2), WMC_caffeine = c(3.7, 7.3, 7.4, 7.8, 5.2, 6.6, 7.9, 5.9, 6.2, 7.4), WMC_no.drug = c(3.9, 7.9, 7.3, 8.2, 7, 7.2, 8.9, 4.5, 7.2, 7.8), RT_alcohol = c(488L, 607L, 643L, 684L, 593L, 492L, 690L, 486L, 686L, 645L), RT_caffeine = c(236L, 376L, 226L, 206L, 262L, 230L, 259L, 230L, 273L, 240L), RT_no.drug = c(371L, 349L, 412L, 252L, 439L, 464L, 327L, 305L, 327L, 498L)), .Names = c(id, gender, WMC_alcohol, WMC_caffeine, WMC_no.drug, RT_alcohol, RT_caffeine, RT_no.drug), class = data.frame, row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) Output Data Frame ### dput(df.output) structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L), gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c(female, male), class = factor), drug = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c(alcohol, caffeine), class = factor), WMC = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2, 3.7, 7.3), RT = c(488L, 607L, 643L, 684L, 593L, 492L, 690L, 486L, 686L, 645L, 236L, 376L)), .Names = c(id, gender, drug, WMC, RT), class = data.frame, row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)) Cheers ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with Cast Function
Hi, First, a big thanks to all those who replied. I am including all the replies in one email for easier reference later: # Input from David # reshape(df.1, idvar=1:2, sep=_, direction=long, varying=names(df.1)[3:8]) # # Input from Dennis # dfr1 - reshape(df.1, idvar = c(id, gender), v.names = c(WMC, RT), timevar = type, times = c(alcohol, caffeine, no.drug), varying = list(3:5, 6:8), direction = long) rownames(dfr1) - NULL dfr # # Input from Arun # library(reshape2) library(plyr) join_all(lapply(c(WMC,RT),function(x) transform(melt(df.1[,c(1:2,grep(x,names(df.1)))],id.vars=c(id,gender),var=drug),drug=gsub(.*\\_,,drug))),by=c (id,gender,drug)) # Cheers ! On Sat, Nov 30, 2013 at 1:20 AM, David Winsemius dwinsem...@comcast.netwrote: On Nov 29, 2013, at 9:42 AM, Burhan ul haq wrote: Hi, This is the input data frame: ### df.1 = read.table(header=T,text= id gender WMC_alcohol WMC_caffeine WMC_no.drug RT_alcohol RT_caffeine RT_no.drug 1 1 female 3.7 3.7 3.9 488 236 371 2 2 female 6.4 7.3 7.9 607 376 349 3 3 female 4.6 7.4 7.3 643 226 412 4 4 male 6.4 7.8 8.2 684 206 252 5 5 female 4.9 5.2 7.0 593 262 439 6 6 male 5.4 6.6 7.2 492 230 464 7 7 male 7.9 7.9 8.9 690 259 327 8 8 male 4.1 5.9 4.5 486 230 305 9 9 female 5.2 6.2 7.2 686 273 327 10 10 female 6.2 7.4 7.8 645 240 498 ) ### This is the desired output: ### id gender drug WMC RT 1 1 female alcohol 3.7 488 2 2 female alcohol 6.4 607 3 3 female alcohol 4.6 643 4 4 male alcohol 6.4 684 5 5 female alcohol 4.9 593 6 6 male alcohol 5.4 492 7 7 male alcohol 7.9 690 8 8 male alcohol 4.1 486 9 9 female alcohol 5.2 686 10 10 female alcohol 6.2 645 11 1 female caffeine 3.7 236 12 2 female caffeine 7.3 376 ### I know some melt and cast magic is required. But I was unable to sort it myself. # this is base::reshape reshape(df.1, idvar=1:2, sep=_, direction=long, varying=names(df.1)[3:8]) Here are the dput versions: Input Data Frame ### dput(df.1) structure(list(id = 1:10, gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), .Label = c(female, male), class = factor), WMC_alcohol = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2), WMC_caffeine = c(3.7, 7.3, 7.4, 7.8, 5.2, 6.6, 7.9, 5.9, 6.2, 7.4), WMC_no.drug = c(3.9, 7.9, 7.3, 8.2, 7, 7.2, 8.9, 4.5, 7.2, 7.8), RT_alcohol = c(488L, 607L, 643L, 684L, 593L, 492L, 690L, 486L, 686L, 645L), RT_caffeine = c(236L, 376L, 226L, 206L, 262L, 230L, 259L, 230L, 273L, 240L), RT_no.drug = c(371L, 349L, 412L, 252L, 439L, 464L, 327L, 305L, 327L, 498L)), .Names = c(id, gender, WMC_alcohol, WMC_caffeine, WMC_no.drug, RT_alcohol, RT_caffeine, RT_no.drug), class = data.frame, row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) Output Data Frame ### dput(df.output) structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L), gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c(female, male), class = factor), drug = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c(alcohol, caffeine), class = factor), WMC = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2, 3.7, 7.3), RT = c(488L, 607L, 643L, 684L, 593L, 492L, 690L, 486L, 686L, 645L, 236L, 376L)), .Names = c(id, gender, drug, WMC, RT), class = data.frame, row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)) Cheers ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Relative Cumulative Frequency of Event Occurence
Hi Arun, Thanks again. Comment noted :) Amazing use of regular expressions in your solutions. Any reference, or book you would recommend. Cheers ! On Fri, Nov 29, 2013 at 10:56 PM, arun smartpink...@yahoo.com wrote: Hi Burhan, No problem. One suggestion in this code would be: with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 ##==TRUE is not needed identical( with(df.1, cumsum(E.Occur)/(seq_len(nrow(df.1, with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 ) is.logical(TRUE) #[1] TRUE is.logical(Yes) #[1] FALSE A.K. On Friday, November 29, 2013 12:36 PM, Burhan ul haq ulh...@gmail.com wrote: Hi Arun, Thanks a lot. It works perfectly. Here is the complete code - for all those who are interested to see Rel Cum Freq oscillating to reach the Expected Value # Bernouilli Trial where: v.fly=c(G,B) # Outcome is Green or Blue fly n=100 # No of Events / Trials v.smp = seq(1:n) # Event Id v.fst = sample(v.fly,n,rep=T) # Simulating First Draw v.sec = sample(v.fly,n,rep=T) # Simulating Second Draw df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if color is same in both the the draws df.1$Rel.Freq = with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 # Relative Frequency df.1$Rel.Freq = round(df.1$Rel.Freq,2) ggplot(df.1, aes(x=sample,y=Rel.Freq))+geom_line(col=green,size=2)+geom_abline(intercept=0.5,slope=0)+geom_point(col=blue)+labs(x=Sample No,y=Relative Cum Freq,title=Rel Cum Freq approaching 0.5 Value) + annotate(text,x=60,y=0.53,label=Probability of 0.5) Cheers ! On Thu, Nov 28, 2013 at 9:40 PM, arun smartpink...@yahoo.com wrote: HI, From the dput() version of df.1, it looks like you want: cumsum(df.1[,4]==Yes)/seq_len(nrow(df.1)) [1] 0.000 0.500 0.333 0.250 0.400 0.333 0.4285714 [8] 0.500 0.444 0.500 A.K. On Thursday, November 28, 2013 11:26 AM, Burhan ul haq ulh...@gmail.com wrote: Hi, My objective is to calculate Relative (Cumulative) Frequency of Event Occurrence - something as follows: Sample.Number 1st.Fly 2nd.Fly Did.E.occur? Relative.Cum.Frequency.of.E 1 G B No 0.000 2 B B Yes 0.500 3 B G No 0.333 4 G B No 0.250 5 G G Yes 0.400 6 G B No 0.333 7 B B Yes 0.429 8 G G Yes 0.500 9 G B No 0.444 10 B B Yes 0.500 Please refer to the code below: ## # 1. v.fly=c(G,B) # Outcome is Green or Blue fly # 2. n=10 # No of Events / Trials # 3. v.smp = seq(1:n) # Event Id # 4. v.fst = sample(v.fly,n,rep=T) # Simulating First Draw # 5. v.sec = sample(v.fly,n,rep=T) # Simulating Second Draw # 6. df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF # 7. df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if color is same in both the the draws # 8. df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency This line does NOT work, and needs to fix the denominator part ## Problem is with #8, specifically the part: cumsum(E.occur)/(E.Occur) The denominator E.Occur is a fixed value, instead of a moving count. I have tried nrow(), length() but none provides a moving version of row count, as cumsum does for the True values, occurring so far. dput(df.1) structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B, G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G, B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes, No, No, Yes, No, Yes, Yes, No, Yes), Relative.Cum.Frequency.of.E = c(0, 0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names = c(Sample.Number, X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E ), class = data.frame, row.names = c(NA, -10L)) Cheers ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Relative Cumulative Frequency of Event Occurence
Hi, My objective is to calculate Relative (Cumulative) Frequency of Event Occurrence - something as follows: Sample.Number 1st.Fly 2nd.Fly Did.E.occur? Relative.Cum.Frequency.of.E 1 G B No 0.000 2 B B Yes 0.500 3 B G No 0.333 4 G B No 0.250 5 G G Yes 0.400 6 G B No 0.333 7 B B Yes 0.429 8 G G Yes 0.500 9 G B No 0.444 10 B B Yes 0.500 Please refer to the code below: ## # 1. v.fly=c(G,B) # Outcome is Green or Blue fly # 2. n=10 # No of Events / Trials # 3. v.smp = seq(1:n) # Event Id # 4. v.fst = sample(v.fly,n,rep=T) # Simulating First Draw # 5. v.sec = sample(v.fly,n,rep=T) # Simulating Second Draw # 6. df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF # 7. df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if color is same in both the the draws # 8. df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency This line does NOT work, and needs to fix the denominator part ## Problem is with #8, specifically the part: cumsum(E.occur)/(E.Occur) The denominator E.Occur is a fixed value, instead of a moving count. I have tried nrow(), length() but none provides a moving version of row count, as cumsum does for the True values, occurring so far. dput(df.1) structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B, G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G, B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes, No, No, Yes, No, Yes, Yes, No, Yes), Relative.Cum.Frequency.of.E = c(0, 0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names = c(Sample.Number, X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E ), class = data.frame, row.names = c(NA, -10L)) Cheers ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Generating Frequency Values
Hi, My problem is as follows: INPUT: Frequency from one column and value of Piglets from another one OUTPUT: Repeat this Piglet value as per the Frequency i.e. Piglet 1, Frequency 3, implies 1,1,1 Piglet 7, Frequency 2, implies 7,7 SOLUTION: This is what I have tried so far: 1. A helper function: dput(fn.1) function (df.1, vt.1) { i = c(1) for (i in seq_along(dim(df.1)[1])) { print(i) temp = rep(df.1$Piglets[i], df.1$Frequency[i]) append(vt.1, values = temp) } } 2. A dummy data frame: dput(df.1) structure(list(Piglets = 5:14, Frequency = c(1L, 0L, 2L, 3L, 3L, 9L, 8L, 5L, 3L, 2L)), .Names = c(Piglets, Frequency), class = data.frame, row.names = c(NA, -10L)) 3. A dummy vector to hold results: dput(vt.1) 1 4. Finally the function call: fn.1(df.1, vt.1) 5. The results is: [1] 1 PROBLEM: The result is not a repetition of Piglet value as per their respective frequencies. Thanks in advance for guidance and help. CheeRs ! p.s I have used caps for my heading / sections, nothing else is implied by their use. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Generating Frequency Values
Hi, A big thanks to everyone who replied. But special ones to Berend for pointing out my mistakes, that will really help me in future. Cheers ! On Tue, Nov 26, 2013 at 11:19 PM, Berend Hasselman b...@xs4all.nl wrote: On 26-11-2013, at 15:59, Burhan ul haq ulh...@gmail.com wrote: Hi, My problem is as follows: INPUT: Frequency from one column and value of Piglets from another one OUTPUT: Repeat this Piglet value as per the Frequency i.e. Piglet 1, Frequency 3, implies 1,1,1 Piglet 7, Frequency 2, implies 7,7 SOLUTION: This is what I have tried so far: 1. A helper function: dput(fn.1) function (df.1, vt.1) { i = c(1) for (i in seq_along(dim(df.1)[1])) { print(i) temp = rep(df.1$Piglets[i], df.1$Frequency[i]) append(vt.1, values = temp) } } There is a lot wrong with your function. You should assign the result of append to vt.1 The function should return vt.1 Use seq_len instead of seq_along. The function should be something like this fn.1 - function (df.1, vt.1) { for (i in seq_len(length.out=dim(df.1)[1])) { print(i) temp = rep(df.1$Piglets[i], df.1$Frequency[i]) vt.1 - append(vt.1, values = temp) } vt.1 } But Sarahs solution is the way to go. Berend 2. A dummy data frame: dput(df.1) structure(list(Piglets = 5:14, Frequency = c(1L, 0L, 2L, 3L, 3L, 9L, 8L, 5L, 3L, 2L)), .Names = c(Piglets, Frequency), class = data.frame, row.names = c(NA, -10L)) 3. A dummy vector to hold results: dput(vt.1) 1 4. Finally the function call: fn.1(df.1, vt.1) 5. The results is: [1] 1 PROBLEM: The result is not a repetition of Piglet value as per their respective frequencies. Thanks in advance for guidance and help. CheeRs ! p.s I have used caps for my heading / sections, nothing else is implied by their use. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.