Re: [R] Removing a data subset
Hi David, You "just" need to learn how to subset your data.frame, see functions like ?subset and ?"[", as well as a good guide to understand the subtleties! Some graphic functions also have a built-in argument to subset within the function (e.g. argument 'subset' in 'plot.formula'), although the ggplot() function doesn't seem to have it. In any case, I would recommend you spend some time learning that aspect, as you will always need it in one situation or another. HTH, Ivan -- Dr. Ivan Calandra TraCEr, laboratory for Traceology and Controlled Experiments MONREPOS Archaeological Research Centre and Museum for Human Behavioural Evolution Schloss Monrepos 56567 Neuwied, Germany +49 (0) 2631 9772-243 https://www.researchgate.net/profile/Ivan_Calandra On 29/11/2017 22:07, David Doyle wrote: Say I have a dataset that looks like LocationYear GW_Elv MW011999 546.63 MW021999 474.21 MW031999 471.94 MW041999466.80 MW012000545.90 MW022000546.10 The whole dataset is at http://doylesdartden.com/ExampleData.csv and I use the code below to do the graph but I want to do it without MW01. How can I remove MW01?? I'm sure I can do it by SubSeting but I can not figure out how to do it. Thank you David -- library(ggplot2) MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE, sep=",") #Sets whic are detections and nondetects MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect") #Removes the NAs MyDataWONA <- MyData[!is.na(MyData$Detections), ] #does the plot p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) + geom_point(aes(shape=Detections)) + ##sets the colors scale_colour_manual(values=c("black","red")) + #scale_y_log10() + #location of the legend theme(legend.position=c("right")) + #sets the line color, type and size geom_line(colour="black", linetype="dotted", size=0.5) + ylab("Elevation Feet Mean Sea Level") ## does the graph using the Location IDs as the different Locations. p + facet_grid(Location ~ .) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 2^3 confounded factorial experiment
> On Nov 29, 2017, at 9:20 AM, Jyoti Bhogalwrote: > > The following R commands were written: >> help.search("factorial") >> data(npk) >> npk >> coef(npk.aov) > > In the output of coef command, please explain me the interpretation of > coefficients of block1 to block 6 in this 2^3 confounded factorial experiment. This is very much a statistics question and as such is off-topic (as it also would be off-topic on StackOverflow.) Rhelp is for persons having difficulty coding the R language itself. Consider CrossValidated.com but read their posting help section first since this is really very terse question. Better would be to include the output and make your best interpretation so peolple get the sense you at least put in some individual effort. > > Thanks. > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
Hi again, Typo in the last email. Should read "about 40 standard deviations". Jim On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemonwrote: > Hi Robert, > People want different levels of automation in the software they use. > What concerns many of us is the desire for the function > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". > Such users typically want something that justifies its use by being > written by someone who seems to know what they're doing and lots of > other people use it. One advantage of many R functions is their > modular construction. This encourages users to at least consider the > steps that are taken rather than just accept what comes out of that > long tube. > > Take the contentious problem of outlier identification. If I just let > the black box peel off some values, I don't know what I have lost. On > the other hand, if I import data and examine it with a summary > function, I may find that one woman has a height of 5.2 meters. I can > range check by looking up the Guinness Book of Records. It's an > outlier. I can estimate the probability of such a height. Hmm, about > 4 standard deviations above the mean. It's an outlier. I can attempt a > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") > has been recorded as a metric value". It's not an outlier. > > The more R gravitates toward "black box" functions, the more some > users are encouraged to let them do the work.You pays your money and > you takes your chances. > > Jim > > > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins wrote: >> R has a very wide audience, clinical research, astronomy, psychology, and >> so on and so on. >> I would consider data analysis work to be three stages: data preparation, >> statistical analysis, and producing the report. >> This regards the process of getting the data ready for analysis and >> reporting, sometimes called "data cleaning" or "data munging" or "data >> wrangling". >> >> So as regards tools for data preparation, speaking to the highly diverse >> audience mentioned, here is my question: >> >> What do you want? >> Or are you already quite happy with the range of tools that is currently >> before you? >> >> [BTW, I posed the same question last week to the r-devel list, and was >> advised that r-help might be a more suitable audience by one of the >> moderators.] >> >> Robert Wilkins >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
Hi Robert, People want different levels of automation in the software they use. What concerns many of us is the desire for the function "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". Such users typically want something that justifies its use by being written by someone who seems to know what they're doing and lots of other people use it. One advantage of many R functions is their modular construction. This encourages users to at least consider the steps that are taken rather than just accept what comes out of that long tube. Take the contentious problem of outlier identification. If I just let the black box peel off some values, I don't know what I have lost. On the other hand, if I import data and examine it with a summary function, I may find that one woman has a height of 5.2 meters. I can range check by looking up the Guinness Book of Records. It's an outlier. I can estimate the probability of such a height. Hmm, about 4 standard deviations above the mean. It's an outlier. I can attempt a Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") has been recorded as a metric value". It's not an outlier. The more R gravitates toward "black box" functions, the more some users are encouraged to let them do the work.You pays your money and you takes your chances. Jim On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkinswrote: > R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] dplyr - add/expand rows
On 11/29/2017 05:47 PM, Tóth Dénes wrote: Hi Martin, On 11/29/2017 10:46 PM, Martin Morgan wrote: On 11/29/2017 04:15 PM, Tóth Dénes wrote: Hi, A benchmarking study with an additional (data.table-based) solution. I don't think speed is the right benchmark (I do agree that correctness is!). Well, agree, and sorry for the wording. It was really just an exercise and not a full evaluation of the approaches. When I read the avalanche of solutions neither of which mentioning data.table (my first choice for data.frame-manipulations), I became curious how a one-liner data.table code performs against the other solutions in terms of speed and readability. Second, I quite often have the feeling that dplyr is extremely overused among novice (and sometimes even experienced) R users nowadays. This is unfortunate, as the present example also illustrates. Another solution is Bill's approach and dplyr's implementation (adding the 1L to keep integers integers!) fun_bill1 <- function(d) { i <- rep(seq_len(nrow(d)), d$to - d$from + 1L) j <- sequence(d$to - d$from + 1L) ## d[i,] %>% mutate(year = from + j - 1L, from = NULL, to = NULL) mutate(d[i,], year = from + j - 1L, from = NULL, to = NULL) } which is competitive with IRanges and data.table (the more dplyr-ish? solution d[i, ] %>% mutate(year = from + j - 1L) %>% select(station, record, year)) has intermediate performance) and might appeal to those introduced to R through dplyr but wanting more base R knowledge, and vice versa. I think if dplyr introduces new users to R, or exposes R users to new approaches for working with data, that's great! Martin Regards, Denes For the R-help list, maybe something about least specialized R knowledge required would be appropriate? I'd say there were some 'hard' solutions -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep understanding of data.table), Jim (at least for me moderate understanding of dplyr,especially the .$ notation; a simpler dplyr answer might have moved this response out of the 'difficult' category, especially given the familiarity of the OP with dplyr). I'd vote for Bill's as requiring the least specialized knowledge of R (though the +/- 1 indexing is an easy thing to get wrong). A different criteria might be reuse across analysis scenarios. Bill seems to win here again, since the principles are very general and at least moderately efficient (both Bert and Martin's solutions are essentially R-level iterations and have poor scalability, as demonstrated in the microbenchmarks; Bill's is mostly vectorized). Certainly data.table, dplyr, and IRanges are extremely useful within the confines of the problem domains they address. Martin Enjoy! ;) Cheers, Denes -- ## packages ## library(dplyr) library(data.table) library(IRanges) library(microbenchmark) ## prepare example dataset ### ## use Bert's example, with 2000 stations instead of 2 d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), from = as.integer(c(60,61,71,72,76,60,65,82,83)), to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), record = c("A","B","C","B","D","B","B","D","E"), stringsAsFactors = FALSE) stations <- rle(d_df$station) stations$value <- gsub( " ", "0", paste0("station", format(1:length(stations$value), width = 6))) d_df$station <- rep(stations$value, stations$lengths) ## prepare tibble and data.table versions d_tbl <- as_tibble(d_df) d_dt <- as.data.table(d_df) ## solutions ## ## Bert - by fun_bert <- function(d) { out <- by( d, d$station, function(x) with(x, { i <- to - from +1 data.frame(record =rep(record,i), year =sequence(i) -1 + rep(from,i), stringsAsFactors = FALSE) })) data.frame(station = rep(names(out), sapply(out,nrow)), do.call(rbind,out), row.names = NULL, stringsAsFactors = FALSE) } ## Bill - transform fun_bill <- function(d) { i <- rep(seq_len(nrow(d)), d$to-d$from+1) j <- sequence(d$to-d$from+1) transform(d[i,], year=from+j-1, from=NULL, to=NULL) } ## Michael - IRanges fun_michael <- function(d) { df <- with(d, DataFrame(station, record, year=IRanges(from, to))) expand(df, "year") } ## Jim - dplyr fun_jim <- function(d) { d %>% rowwise() %>% do(tibble(station = .$station, record = .$record, year = seq(.$from, .$to)) ) } ## Martin - Map fun_martin <- function(d) { d$year <- with(d, Map(seq, from, to)) res0 <- with(d, Map(data.frame, station=station, record=record, year=year, MoreArgs = list(stringsAsFactors = FALSE))) do.call(rbind, unname(res0)) } ## Denes - simple data.table
Re: [R] dplyr - add/expand rows
Hi Martin, On 11/29/2017 10:46 PM, Martin Morgan wrote: On 11/29/2017 04:15 PM, Tóth Dénes wrote: Hi, A benchmarking study with an additional (data.table-based) solution. I don't think speed is the right benchmark (I do agree that correctness is!). Well, agree, and sorry for the wording. It was really just an exercise and not a full evaluation of the approaches. When I read the avalanche of solutions neither of which mentioning data.table (my first choice for data.frame-manipulations), I became curious how a one-liner data.table code performs against the other solutions in terms of speed and readability. Second, I quite often have the feeling that dplyr is extremely overused among novice (and sometimes even experienced) R users nowadays. This is unfortunate, as the present example also illustrates. Regards, Denes For the R-help list, maybe something about least specialized R knowledge required would be appropriate? I'd say there were some 'hard' solutions -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep understanding of data.table), Jim (at least for me moderate understanding of dplyr,especially the .$ notation; a simpler dplyr answer might have moved this response out of the 'difficult' category, especially given the familiarity of the OP with dplyr). I'd vote for Bill's as requiring the least specialized knowledge of R (though the +/- 1 indexing is an easy thing to get wrong). A different criteria might be reuse across analysis scenarios. Bill seems to win here again, since the principles are very general and at least moderately efficient (both Bert and Martin's solutions are essentially R-level iterations and have poor scalability, as demonstrated in the microbenchmarks; Bill's is mostly vectorized). Certainly data.table, dplyr, and IRanges are extremely useful within the confines of the problem domains they address. Martin Enjoy! ;) Cheers, Denes -- ## packages ## library(dplyr) library(data.table) library(IRanges) library(microbenchmark) ## prepare example dataset ### ## use Bert's example, with 2000 stations instead of 2 d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), from = as.integer(c(60,61,71,72,76,60,65,82,83)), to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), record = c("A","B","C","B","D","B","B","D","E"), stringsAsFactors = FALSE) stations <- rle(d_df$station) stations$value <- gsub( " ", "0", paste0("station", format(1:length(stations$value), width = 6))) d_df$station <- rep(stations$value, stations$lengths) ## prepare tibble and data.table versions d_tbl <- as_tibble(d_df) d_dt <- as.data.table(d_df) ## solutions ## ## Bert - by fun_bert <- function(d) { out <- by( d, d$station, function(x) with(x, { i <- to - from +1 data.frame(record =rep(record,i), year =sequence(i) -1 + rep(from,i), stringsAsFactors = FALSE) })) data.frame(station = rep(names(out), sapply(out,nrow)), do.call(rbind,out), row.names = NULL, stringsAsFactors = FALSE) } ## Bill - transform fun_bill <- function(d) { i <- rep(seq_len(nrow(d)), d$to-d$from+1) j <- sequence(d$to-d$from+1) transform(d[i,], year=from+j-1, from=NULL, to=NULL) } ## Michael - IRanges fun_michael <- function(d) { df <- with(d, DataFrame(station, record, year=IRanges(from, to))) expand(df, "year") } ## Jim - dplyr fun_jim <- function(d) { d %>% rowwise() %>% do(tibble(station = .$station, record = .$record, year = seq(.$from, .$to)) ) } ## Martin - Map fun_martin <- function(d) { d$year <- with(d, Map(seq, from, to)) res0 <- with(d, Map(data.frame, station=station, record=record, year=year, MoreArgs = list(stringsAsFactors = FALSE))) do.call(rbind, unname(res0)) } ## Denes - simple data.table fun_denes <- function(d) { out <- d[, .(year = from:to), by = .(station, from, record)] out[, from := NULL] } ## Check equality all.equal(fun_bill(d_df), fun_bert(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), fun_martin(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), check.attributes = FALSE) # Be prepared: this solution is super slow all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), check.attributes = FALSE) ## Benchmark # ## Martin print(system.time(fun_martin(d_df))) ## Bert print(system.time(fun_bert(d_df))) ## Top 3 print( microbenchmark(
Re: [R] Removing a data subset
Reading in the data from the file x <- read.csv( "ExampleData.csv", header = TRUE, stringsAsFactors = FALSE ) Subsetting as you want x <- x[ x$Location != "MW01", ] This selects all rows where the value in column 'Location' is not equal to "MW01". The comma after that ensures that all columns are copied into the amended data.frame. Rgds, Rainer On Mittwoch, 29. November 2017 15:07:34 +08 David Doyle wrote: > Say I have a dataset that looks like > > LocationYear GW_Elv > MW011999 546.63 > MW021999 474.21 > MW031999 471.94 > MW041999466.80 > MW012000545.90 > MW022000546.10 > > The whole dataset is at http://doylesdartden.com/ExampleData.csv > and I use the code below to do the graph but I want to do it without MW01. > How can I remove MW01?? > > I'm sure I can do it by SubSeting but I can not figure out how to do it. > > Thank you > David > > -- > > library(ggplot2) > > MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE, > sep=",") > > > > #Sets whic are detections and nondetects > MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect") > > #Removes the NAs > MyDataWONA <- MyData[!is.na(MyData$Detections), ] > > #does the plot > p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) + > geom_point(aes(shape=Detections)) + > > ##sets the colors > scale_colour_manual(values=c("black","red")) + #scale_y_log10() + > > #location of the legend > theme(legend.position=c("right")) + > > #sets the line color, type and size > geom_line(colour="black", linetype="dotted", size=0.5) + > ylab("Elevation Feet Mean Sea Level") > > ## does the graph using the Location IDs as the different Locations. > p + facet_grid(Location ~ .) > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] dplyr - add/expand rows
On 11/29/2017 04:15 PM, Tóth Dénes wrote: Hi, A benchmarking study with an additional (data.table-based) solution. I don't think speed is the right benchmark (I do agree that correctness is!). For the R-help list, maybe something about least specialized R knowledge required would be appropriate? I'd say there were some 'hard' solutions -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep understanding of data.table), Jim (at least for me moderate understanding of dplyr,especially the .$ notation; a simpler dplyr answer might have moved this response out of the 'difficult' category, especially given the familiarity of the OP with dplyr). I'd vote for Bill's as requiring the least specialized knowledge of R (though the +/- 1 indexing is an easy thing to get wrong). A different criteria might be reuse across analysis scenarios. Bill seems to win here again, since the principles are very general and at least moderately efficient (both Bert and Martin's solutions are essentially R-level iterations and have poor scalability, as demonstrated in the microbenchmarks; Bill's is mostly vectorized). Certainly data.table, dplyr, and IRanges are extremely useful within the confines of the problem domains they address. Martin Enjoy! ;) Cheers, Denes -- ## packages ## library(dplyr) library(data.table) library(IRanges) library(microbenchmark) ## prepare example dataset ### ## use Bert's example, with 2000 stations instead of 2 d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), from = as.integer(c(60,61,71,72,76,60,65,82,83)), to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), record = c("A","B","C","B","D","B","B","D","E"), stringsAsFactors = FALSE) stations <- rle(d_df$station) stations$value <- gsub( " ", "0", paste0("station", format(1:length(stations$value), width = 6))) d_df$station <- rep(stations$value, stations$lengths) ## prepare tibble and data.table versions d_tbl <- as_tibble(d_df) d_dt <- as.data.table(d_df) ## solutions ## ## Bert - by fun_bert <- function(d) { out <- by( d, d$station, function(x) with(x, { i <- to - from +1 data.frame(record =rep(record,i), year =sequence(i) -1 + rep(from,i), stringsAsFactors = FALSE) })) data.frame(station = rep(names(out), sapply(out,nrow)), do.call(rbind,out), row.names = NULL, stringsAsFactors = FALSE) } ## Bill - transform fun_bill <- function(d) { i <- rep(seq_len(nrow(d)), d$to-d$from+1) j <- sequence(d$to-d$from+1) transform(d[i,], year=from+j-1, from=NULL, to=NULL) } ## Michael - IRanges fun_michael <- function(d) { df <- with(d, DataFrame(station, record, year=IRanges(from, to))) expand(df, "year") } ## Jim - dplyr fun_jim <- function(d) { d %>% rowwise() %>% do(tibble(station = .$station, record = .$record, year = seq(.$from, .$to)) ) } ## Martin - Map fun_martin <- function(d) { d$year <- with(d, Map(seq, from, to)) res0 <- with(d, Map(data.frame, station=station, record=record, year=year, MoreArgs = list(stringsAsFactors = FALSE))) do.call(rbind, unname(res0)) } ## Denes - simple data.table fun_denes <- function(d) { out <- d[, .(year = from:to), by = .(station, from, record)] out[, from := NULL] } ## Check equality all.equal(fun_bill(d_df), fun_bert(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), fun_martin(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), check.attributes = FALSE) # Be prepared: this solution is super slow all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), check.attributes = FALSE) ## Benchmark # ## Martin print(system.time(fun_martin(d_df))) ## Bert print(system.time(fun_bert(d_df))) ## Top 3 print( microbenchmark( fun_bill(d_df), fun_michael(d_df), fun_denes(d_dt), times = 100L ) ) - On 11/28/2017 06:49 PM, Michael Lawrence wrote: Or with the Bioconductor IRanges package: df <- with(input, DataFrame(station, year=IRanges(from, to), record)) expand(df, "year") DataFrame with 24 rows and 3 columns station year record 1 07EA001 1960 QMS 2 07EA001 1961 QMC 3 07EA001 1962 QMC 4 07EA001 1963 QMC 5 07EA001 1964 QMC ... ... ... ... 20 07EA001 1979
Re: [R] dplyr - add/expand rows
Hi, A benchmarking study with an additional (data.table-based) solution. Enjoy! ;) Cheers, Denes -- ## packages ## library(dplyr) library(data.table) library(IRanges) library(microbenchmark) ## prepare example dataset ### ## use Bert's example, with 2000 stations instead of 2 d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), from = as.integer(c(60,61,71,72,76,60,65,82,83)), to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), record = c("A","B","C","B","D","B","B","D","E"), stringsAsFactors = FALSE) stations <- rle(d_df$station) stations$value <- gsub( " ", "0", paste0("station", format(1:length(stations$value), width = 6))) d_df$station <- rep(stations$value, stations$lengths) ## prepare tibble and data.table versions d_tbl <- as_tibble(d_df) d_dt <- as.data.table(d_df) ## solutions ## ## Bert - by fun_bert <- function(d) { out <- by( d, d$station, function(x) with(x, { i <- to - from +1 data.frame(record =rep(record,i), year =sequence(i) -1 + rep(from,i), stringsAsFactors = FALSE) })) data.frame(station = rep(names(out), sapply(out,nrow)), do.call(rbind,out), row.names = NULL, stringsAsFactors = FALSE) } ## Bill - transform fun_bill <- function(d) { i <- rep(seq_len(nrow(d)), d$to-d$from+1) j <- sequence(d$to-d$from+1) transform(d[i,], year=from+j-1, from=NULL, to=NULL) } ## Michael - IRanges fun_michael <- function(d) { df <- with(d, DataFrame(station, record, year=IRanges(from, to))) expand(df, "year") } ## Jim - dplyr fun_jim <- function(d) { d %>% rowwise() %>% do(tibble(station = .$station, record = .$record, year = seq(.$from, .$to)) ) } ## Martin - Map fun_martin <- function(d) { d$year <- with(d, Map(seq, from, to)) res0 <- with(d, Map(data.frame, station=station, record=record, year=year, MoreArgs = list(stringsAsFactors = FALSE))) do.call(rbind, unname(res0)) } ## Denes - simple data.table fun_denes <- function(d) { out <- d[, .(year = from:to), by = .(station, from, record)] out[, from := NULL] } ## Check equality all.equal(fun_bill(d_df), fun_bert(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), fun_martin(d_df), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), check.attributes = FALSE) all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), check.attributes = FALSE) # Be prepared: this solution is super slow all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), check.attributes = FALSE) ## Benchmark # ## Martin print(system.time(fun_martin(d_df))) ## Bert print(system.time(fun_bert(d_df))) ## Top 3 print( microbenchmark( fun_bill(d_df), fun_michael(d_df), fun_denes(d_dt), times = 100L ) ) - On 11/28/2017 06:49 PM, Michael Lawrence wrote: Or with the Bioconductor IRanges package: df <- with(input, DataFrame(station, year=IRanges(from, to), record)) expand(df, "year") DataFrame with 24 rows and 3 columns station year record 1 07EA001 1960 QMS 2 07EA001 1961 QMC 3 07EA001 1962 QMC 4 07EA001 1963 QMC 5 07EA001 1964 QMC ... ... ... ... 20 07EA001 1979 QRC 21 07EA001 1980 QRC 22 07EA001 1981 QRC 23 07EA001 1982 QRC 24 07EA001 1983 QRC If you tell the computer more about your data, it can do more things for you. Michael On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan < martin.mor...@roswellpark.org> wrote: On 11/26/2017 08:42 PM, jim holtman wrote: try this: ## library(dplyr) input <- tribble( ~station, ~from, ~to, ~record, "07EA001" ,1960 , 1960 , "QMS", "07EA001" , 1961 , 1970 , "QMC", "07EA001" ,1971 , 1971 , "QMM", "07EA001" ,1972 , 1976 , "QMC", "07EA001" ,1977 , 1983 , "QRC" ) result <- input %>% rowwise() %>% do(tibble(station = .$station, year = seq(.$from, .$to), record = .$record) ) ### In a bit more 'base R' mode I did input$year <- with(input, Map(seq, from, to)) res0 <- with(input, Map(data.frame, station=station, year=year, record=record)) as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 resulting in as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 station year record 1
[R] Removing a data subset
Say I have a dataset that looks like LocationYear GW_Elv MW011999 546.63 MW021999 474.21 MW031999 471.94 MW041999466.80 MW012000545.90 MW022000546.10 The whole dataset is at http://doylesdartden.com/ExampleData.csv and I use the code below to do the graph but I want to do it without MW01. How can I remove MW01?? I'm sure I can do it by SubSeting but I can not figure out how to do it. Thank you David -- library(ggplot2) MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE, sep=",") #Sets whic are detections and nondetects MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect") #Removes the NAs MyDataWONA <- MyData[!is.na(MyData$Detections), ] #does the plot p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) + geom_point(aes(shape=Detections)) + ##sets the colors scale_colour_manual(values=c("black","red")) + #scale_y_log10() + #location of the legend theme(legend.position=c("right")) + #sets the line color, type and size geom_line(colour="black", linetype="dotted", size=0.5) + ylab("Elevation Feet Mean Sea Level") ## does the graph using the Location IDs as the different Locations. p + facet_grid(Location ~ .) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 2^3 confounded factorial experiment
The following R commands were written: >help.search("factorial") >data(npk) >npk >coef(npk.aov) In the output of coef command, please explain me the interpretation of coefficients of block1 to block 6 in this 2^3 confounded factorial experiment. Thanks. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SAMseq errors
A) This list is a general interest list on the R language... you have posed your question as if you were looking for domain experts such as you might be more likely to find on the Bioconductor mailing list. B) Example is not reproducible. [1][2][3] C) Just because your data don't have missing values does not mean that your early analysis steps don't create them, e.g. by taking the logarithm of negative numbers. Look at intermediate values in your analysis, and read the documentation for steps you are treating as "magic black boxes". [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example [2] http://adv-r.had.co.nz/Reproducibility.html [3] https://cran.r-project.org/web/packages/reprex/index.html (read the vignette) -- Sent from my phone. Please excuse my brevity. On November 29, 2017 9:39:24 AM PST, array chip via R-helpwrote: >Sorry forgot to use plain text format, hope this time it works: > >Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment >(2 genes x 550 samples) with survival endpoint. It quickly give the >following error: > >> library(samr) >Loading required package: impute >Loading required package: matrixStats > >Attaching package: ‘matrixStats’ > >The following objects are masked from ‘package:Biobase’: > > anyMissing, rowMedians > >Warning messages: >1: package ‘samr’ was built under R version 3.3.3 >2: package ‘matrixStats’ was built under R version 3.3.3 > >> samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status, >resp.type="Survival") > >Estimating sequencing depths... >Error in quantile.default(prop, c(0.25, 0.75)) : > missing values and NaN's not allowed if 'na.rm' is FALSE >In addition: Warning message: >In sum(x) : integer overflow - use sum(as.numeric(.)) >Error during wrapup: cannot open the connection > >> sessionInfo() >R version 3.3.2 (2016-10-31) >Platform: x86_64-w64-mingw32/x64 (64-bit) >Running under: Windows 7 x64 (build 7601) Service Pack 1 > >locale: >[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >States.1252 LC_MONETARY=English_United States.1252 >[4] LC_NUMERIC=C LC_TIME=English_United >States.1252 > >attached base packages: >[1] stats graphics grDevices datasets utils methods base > > >other attached packages: >[1] samr_2.0 matrixStats_0.52.2 impute_1.48.0 >BiocInstaller_1.24.0 rcom_3.1-3 rscproxy_2.1-1 > >loaded via a namespace (and not attached): >[1] tools_3.3.2 > > >I checked, my data matrix and y variables have no missing values. >Anyone has suggestions what's going on? > >Thank you! > >John > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] SAMseq errors
Sorry forgot to use plain text format, hope this time it works: Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment (2 genes x 550 samples) with survival endpoint. It quickly give the following error: > library(samr) Loading required package: impute Loading required package: matrixStats Attaching package: ‘matrixStats’ The following objects are masked from ‘package:Biobase’: anyMissing, rowMedians Warning messages: 1: package ‘samr’ was built under R version 3.3.3 2: package ‘matrixStats’ was built under R version 3.3.3 > samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status, > resp.type="Survival") Estimating sequencing depths... Error in quantile.default(prop, c(0.25, 0.75)) : missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: In sum(x) : integer overflow - use sum(as.numeric(.)) Error during wrapup: cannot open the connection > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] samr_2.0 matrixStats_0.52.2 impute_1.48.0 BiocInstaller_1.24.0 rcom_3.1-3 rscproxy_2.1-1 loaded via a namespace (and not attached): [1] tools_3.3.2 I checked, my data matrix and y variables have no missing values. Anyone has suggestions what's going on? Thank you! John __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] SAMseq errors
Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment (2 genes x 550 samples) with survival endpoint. It quickly give the following error: > library(samr)Loading required package: imputeLoading required package: > matrixStats Attaching package: ‘matrixStats’ The following objects are masked from ‘package:Biobase’: anyMissing, rowMedians Warning messages:1: package ‘samr’ was built under R version 3.3.3 2: package ‘matrixStats’ was built under R version 3.3.3 > samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status, > resp.type="Survival") Estimating sequencing depths...Error in quantile.default(prop, c(0.25, 0.75)) : missing values and NaN's not allowed if 'na.rm' is FALSEIn addition: Warning message:In sum(x) : integer overflow - use sum(as.numeric(.))Error during wrapup: cannot open the connection > sessionInfo()R version 3.3.2 (2016-10-31)Platform: x86_64-w64-mingw32/x64 > (64-bit)Running under: Windows 7 x64 (build 7601) Service Pack 1 locale:[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages:[1] stats graphics grDevices datasets utils methods base other attached packages:[1] samr_2.0 matrixStats_0.52.2 impute_1.48.0 BiocInstaller_1.24.0 rcom_3.1-3 rscproxy_2.1-1 loaded via a namespace (and not attached):[1] tools_3.3.2 I checked, my data matrix and y variables have no missing values. Anyone has suggestions what's going on? Thank you! John [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
Christopher, OK, well what about a range of functions in an R package that automatically, with very little syntax, pulls in data from a variety of formats (CSV, SQLite, and so on) and converts them to an R data frame. You seem to be pointing to something like that. Something like that, in some form or another, probably already exists, though it might be either imperfect (not as user-friendly as possible) or not well publicised, or both. Or another tangent: your co-workers are not going to stop using Excel, whether you like it or not, and many end-users are stuck in the exact same position as you (co-workers who deliver the data in Excel). I will guess that data stored in Excel tends to be dirty in somewhat predictable ways. (And again, those other end-user's coworkers are not going to change their behaviour). And so: a data munging tool that makes it as easy as possible to clean up the data in Excel spreadsheets and export them to R data frames. One prerequisite: an understanding of what tends to go wrong with data with Excel ( the data in Excel tends to be dirty, but dirty in what way?). Thank you for your response Christopher. What state are you in? On Wed, Nov 29, 2017 at 11:52 AM, Christopher W. Ryanwrote: > Great question. What do I want? I want my co-workers to stop using Excel > spreadsheets for data entry, storage, and sharing! I want them to > understand the value of data discipline. But alas . . . . > > I work in a county health department in the US. Between dplyr, stringr, > grep, grepl, and the base R read() functions, I'm doing OK. > > I need to learn more about APIs, so I can see if I can make R directly > grab data from, e.g. our state health department sources. My biggest > hassle is having to download a data file, save it somewhere, and then > open R and read it in. I'd like to be able to do it all in R. Would make > the generation of recurring reports easier. > > --Chris Ryan > > Robert Wilkins wrote: > > R has a very wide audience, clinical research, astronomy, psychology, and > > so on and so on. > > I would consider data analysis work to be three stages: data preparation, > > statistical analysis, and producing the report. > > This regards the process of getting the data ready for analysis and > > reporting, sometimes called "data cleaning" or "data munging" or "data > > wrangling". > > > > So as regards tools for data preparation, speaking to the highly diverse > > audience mentioned, here is my question: > > > > What do you want? > > Or are you already quite happy with the range of tools that is currently > > before you? > > > > [BTW, I posed the same question last week to the r-devel list, and was > > advised that r-help might be a more suitable audience by one of the > > moderators.] > > > > Robert Wilkins > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
Oh Crap! I mistakenly replied onlist. PLEASE IGNORE -- these are only my ignorant opinions. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Nov 29, 2017 at 8:48 AM, Bert Gunterwrote: > I don't think my view is of interest to many, so offlist. > > I reject this: > > " I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report." > > For example, there is no such thing as "outliers" -- data to be removed as > part of cleaning/preparation -- without a statistical model to be an > "outlier" **from**, which is part of the statistical analysis. And the > structure of the data (data preparation) may need to change depending on > the course of the analysis (including graphics, also part of the analysis). > So I think your view reflects a naïve view of the nature of data analysis, > which is an iterative and holistic process. I suspect your training is as a > computer scientist and you have not done much 1-1 consulting with > researchers, though you should certainly feel free to reject this canard. > Building software for large scale automated analysis of data required a > much different analytical paradigm than the statistical consulting model, > which is largely my background. > > No reply necessary. Just my opinion, which you are of course free to trash. > > Cheers, > Bert > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins > wrote: > >> R has a very wide audience, clinical research, astronomy, psychology, and >> so on and so on. >> I would consider data analysis work to be three stages: data preparation, >> statistical analysis, and producing the report. >> This regards the process of getting the data ready for analysis and >> reporting, sometimes called "data cleaning" or "data munging" or "data >> wrangling". >> >> So as regards tools for data preparation, speaking to the highly diverse >> audience mentioned, here is my question: >> >> What do you want? >> Or are you already quite happy with the range of tools that is currently >> before you? >> >> [BTW, I posed the same question last week to the r-devel list, and was >> advised that r-help might be a more suitable audience by one of the >> moderators.] >> >> Robert Wilkins >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
Great question. What do I want? I want my co-workers to stop using Excel spreadsheets for data entry, storage, and sharing! I want them to understand the value of data discipline. But alas . . . . I work in a county health department in the US. Between dplyr, stringr, grep, grepl, and the base R read() functions, I'm doing OK. I need to learn more about APIs, so I can see if I can make R directly grab data from, e.g. our state health department sources. My biggest hassle is having to download a data file, save it somewhere, and then open R and read it in. I'd like to be able to do it all in R. Would make the generation of recurring reports easier. --Chris Ryan Robert Wilkins wrote: > R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Data cleaning & Data preparation, what do R users want?
I don't think my view is of interest to many, so offlist. I reject this: " I would consider data analysis work to be three stages: data preparation, statistical analysis, and producing the report." For example, there is no such thing as "outliers" -- data to be removed as part of cleaning/preparation -- without a statistical model to be an "outlier" **from**, which is part of the statistical analysis. And the structure of the data (data preparation) may need to change depending on the course of the analysis (including graphics, also part of the analysis). So I think your view reflects a naïve view of the nature of data analysis, which is an iterative and holistic process. I suspect your training is as a computer scientist and you have not done much 1-1 consulting with researchers, though you should certainly feel free to reject this canard. Building software for large scale automated analysis of data required a much different analytical paradigm than the statistical consulting model, which is largely my background. No reply necessary. Just my opinion, which you are of course free to trash. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkinswrote: > R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Data cleaning & Data preparation, what do R users want?
R has a very wide audience, clinical research, astronomy, psychology, and so on and so on. I would consider data analysis work to be three stages: data preparation, statistical analysis, and producing the report. This regards the process of getting the data ready for analysis and reporting, sometimes called "data cleaning" or "data munging" or "data wrangling". So as regards tools for data preparation, speaking to the highly diverse audience mentioned, here is my question: What do you want? Or are you already quite happy with the range of tools that is currently before you? [BTW, I posed the same question last week to the r-devel list, and was advised that r-help might be a more suitable audience by one of the moderators.] Robert Wilkins [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Preventing repeated package installation, or pre installing packages
Dear Larry, Have a look at https://github.com/inbo/rstable That is a dockerfile with a stable version of R and a set of packages. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkel...@inbo.be Kliniekstraat 25, B-1070 Brussel www.inbo.be /// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /// Van 14 tot en met 19 december 2017 verhuizen we uit onze vestiging in Brussel naar het Herman Teirlinckgebouw op de site Thurn & Taxis. Vanaf dan ben je welkom op het nieuwe adres: Havenlaan 88 bus 73, 1000 Brussel. /// 2017-11-29 15:28 GMT+01:00 Larry Martell: > I have a R script that I call from python using rpy2. It uses dplyr, doBy, > and ggplot2. The script has install.packages commands for these 3 packages. > Even thought the packages are already installed it still downloads, > builds, and installs them, which is very time consuming. Is there a way to > have it only do the install if the package is not already installed? > > Also, I run in a docker container, so after the container is instantiated > the packages are not there the first time the script runs. Is there a way > to pre load the packages, in which case I would not need the > install.packages commands for these packages and my above question would > become moot. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Preventing repeated package installation, or pre installing packages
> On 29 Nov 2017, at 15:28, Larry Martellwrote: > > I have a R script that I call from python using rpy2. It uses dplyr, doBy, > and ggplot2. The script has install.packages commands for these 3 packages. > Even thought the packages are already installed it still downloads, > builds, and installs them, which is very time consuming. Is there a way to > have it only do the install if the package is not already installed? You could use something like if (!require(dplyr)) { install.packages(“dplyr”) library(dplyr) } where require() returns FALSE if it fails to load the package. > > Also, I run in a docker container, so after the container is instantiated > the packages are not there the first time the script runs. Is there a way > to pre load the packages, in which case I would not need the > install.packages commands for these packages and my above question would > become moot. Yes - add them to you Docker file, but this is a docker question, not R. Check out the Rocker Dockerfiles to see how you can do this. Rainer > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) University of Zürich Cell: +41 (0)78 630 66 57 email: rai...@krugs.de Skype: RMkrug PGP: 0x0F52F982 signature.asc Description: Message signed with OpenPGP __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Preventing repeated package installation, or pre installing packages
Dear Larry As far as your first question is concerned I think one of require or requireNamespace may be what you need. Michael On 29/11/2017 14:28, Larry Martell wrote: I have a R script that I call from python using rpy2. It uses dplyr, doBy, and ggplot2. The script has install.packages commands for these 3 packages. Even thought the packages are already installed it still downloads, builds, and installs them, which is very time consuming. Is there a way to have it only do the install if the package is not already installed? Also, I run in a docker container, so after the container is instantiated the packages are not there the first time the script runs. Is there a way to pre load the packages, in which case I would not need the install.packages commands for these packages and my above question would become moot. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Michael http://www.dewey.myzen.co.uk/home.html __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Preventing repeated package installation, or pre installing packages
I have a R script that I call from python using rpy2. It uses dplyr, doBy, and ggplot2. The script has install.packages commands for these 3 packages. Even thought the packages are already installed it still downloads, builds, and installs them, which is very time consuming. Is there a way to have it only do the install if the package is not already installed? Also, I run in a docker container, so after the container is instantiated the packages are not there the first time the script runs. Is there a way to pre load the packages, in which case I would not need the install.packages commands for these packages and my above question would become moot. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] DeSolve Package and Moving Average
Since you only provide pseudo-code I will give a guess as to the source of the problem. It is easy to get "burned" by use of the ifelse statement. Its results have the same "shape" as the first argument. My suggestion is to try replacing ifelse by a standard if ( ) { } else { } HTH, Eric On Wed, Nov 29, 2017 at 1:29 PM, Werning, Jan-Philipp < jan-philipp.wern...@whu.edu> wrote: > Dear all, > > > I am using the DeSolve Package to simulate a system dynamics model. At the > problematic point in the model, I basically want to decide how many > products shall be produced to be sold. In order to determine the amount a > basic forecasting model of using the average of the last 12 time periods > shall be used. My code looks like the following. > > “ […] > > # Time units in month > START<-0; FINISH<-120; STEP<-1 > > # Set seed for reproducability > > set.seed(123) > > # Create time vector > simtime <- seq(START, FINISH, by=STEP) > > # Create a stock vector with initial values > stocks <- c([…]) > > # Create an aux vector for the fixed aux values > auxs<- c([…]) > > > model <- function(time, stocks, auxs){ > with(as.list(c(stocks, auxs)),{ > > [… “lots of aux, flow, and stock functions” … ] > > > aMovingAverage <- ifelse(exists("ResultsSimulation")=="FALSE", > 1,movavg(ResultsSimulation$TotalSales, 12, type = "s”)) > > > return (list(c([…])) > > }) > } > > # Call Solver, and store results in a data frame > ResultsSimulation <- data.frame(ode(y=stocks, times=simtime, func = model, > parms=auxs, method="euler")) > > […]” > > My problem is, that the moving average (function: movavg) is only computed > once and the same value is used in every timestep of the model. I.e. When > running the model for the first time, 1 is used, running it for the > next time the total sales value of the first timestep is used. Since only > one timestep exists, this is logical. Yet I would expect the movavg > function to produce a new value in each of the 120 timesteps, as it is the > case with all other flow, stock and aux calculations as well. > > It would be great if you could help me with fixing this problem. > > > Many thanks in advance! > > Yours, > > Jan > > > > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] DeSolve Package and Moving Average
Dear all, I am using the DeSolve Package to simulate a system dynamics model. At the problematic point in the model, I basically want to decide how many products shall be produced to be sold. In order to determine the amount a basic forecasting model of using the average of the last 12 time periods shall be used. My code looks like the following. “ […] # Time units in month START<-0; FINISH<-120; STEP<-1 # Set seed for reproducability set.seed(123) # Create time vector simtime <- seq(START, FINISH, by=STEP) # Create a stock vector with initial values stocks <- c([…]) # Create an aux vector for the fixed aux values auxs<- c([…]) model <- function(time, stocks, auxs){ with(as.list(c(stocks, auxs)),{ [… “lots of aux, flow, and stock functions” … ] aMovingAverage <- ifelse(exists("ResultsSimulation")=="FALSE",1,movavg(ResultsSimulation$TotalSales, 12, type = "s”)) return (list(c([…])) }) } # Call Solver, and store results in a data frame ResultsSimulation <- data.frame(ode(y=stocks, times=simtime, func = model, parms=auxs, method="euler")) […]” My problem is, that the moving average (function: movavg) is only computed once and the same value is used in every timestep of the model. I.e. When running the model for the first time, 1 is used, running it for the next time the total sales value of the first timestep is used. Since only one timestep exists, this is logical. Yet I would expect the movavg function to produce a new value in each of the 120 timesteps, as it is the case with all other flow, stock and aux calculations as well. It would be great if you could help me with fixing this problem. Many thanks in advance! Yours, Jan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R-es] Búsqueda de palabras en una variable de R
readLines() El mié., 29 nov. 2017 5:51,escribió: > Muchas gracias, > > Estoy intentado ejecutar el paquete y necesito importar el archivo txt, > pero necesito importarlo de modo que cada línea sea una observación y no > un texto único (tengo unas 63,000 lineas). No encuentro la solución en los > enlaces. ¿Sabrías como hacerlo? > > Gracias! > El Mar, 28 de Noviembre de 2017, 3:50, Freddy Omar López Quintero escribió: > > El mar, 28-11-2017 a las 03:42 +0100, miriam.alz...@unavarra.es > > escribió: > >> Tengo un vector de 40 palabras (marca) y necesito saber si en una de > >> las > >> variables del data.frame (datos) se incluye alguna de esas 40 > >> palabras. Si > >> se incluye alguna de ellas, me gustaría crear una variable dummy > >> siendo 1 > >> que incluye alguna palabra y 0 que no incluye. > >> > >> ¿Qué paquete me recomendáis? ¿Cuál sería el comando a ejecutar? > > > > Lo que describes luce como minería de texto y lo que parece que quieres > > es una porción de la matriz que llaman Term-Document Matrix. El paquete > > por excelencia para estos menesteres es tm: > > > > https://cran.r-project.org/web/packages/tm/ > > > > que tiene su buena viñeta > > > > https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf > > > > Ojalá sirva. > > > > Saludos. > > > > > > -- > > «...homines autem hominum causa esse generatos...» > > > > Cicero > > ___ > R-help-es mailing list > R-help-es@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-help-es > [[alternative HTML version deleted]] ___ R-help-es mailing list R-help-es@r-project.org https://stat.ethz.ch/mailman/listinfo/r-help-es
Re: [R] How to extract coefficients from sequential (type 1), ANOVAs using lmer and lme
(This time with the r-help in the recipients...) Be careful when mixing lme4 and lmerTest together -- lmerTest extends and changes the behavior of various lme4 functions. From the help page for lme4-anova (?lme4::anova.merMod) > ‘anova’: returns the sequential decomposition of the contributions > of fixed-effects terms or, for multiple arguments, model > comparison statistics. For objects of class ‘lmerMod’ the > default behavior is to refit the models with ML if fitted > with ‘REML = TRUE’, this can be controlled via the ‘refit’ > argument. See also ‘anova’. So lme4-anova will give you sequential tests; note, however, that lme4 won't calculate the denominator degrees of freedom for you and thus won't give p-values. See the FAQ (https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-are-p_002dvalues-not-displayed-when-using-lmer_0028_0029_003f) From the help page for lmerTest-anova (?lmerTest::anova.merModLmerTest): > Usage: > > ## S4 method for signature 'merModLmerTest' > anova(object, ... , ddf="Satterthwaite", > type=3) > > Arguments: > ... > type: type of hypothesis to be tested. Could be type=3 or type=2 or > type = 1 (The definition comes from SAS theory) So lmerTest-anova by default gives you Type III ('marginal', although Type II is what actually gives you tests that respect the Principle of Marginality; see John Fox's Applied Regression Analysis (book) or Venables' "Exegeses on Linear Models" (https://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf) for more information on that. Type I tests are the sequential tests, so with anova(model, type=1), you will get the sequential tests you want. lmerTest will approximate the denominator degrees of freedom for you (using Satterthwaite method by default, or the more computationally intensive Kenward-Roger method), so you'll get p-values if that's what you want. Finally, it's important to note two things: 1. The "type"-argument for nlme::summary doesn't actually do anything (see ?nlme::summary.lme). It's just passed onto the 'print' method, where it's silently ignored. The 'type' of sum of squares is an ANOVA-thing; the closest correspondence in terms of model coefficients is the coding of your categorical contrasts. See the literature mentioned above for more details as well as Dale Barr's discussion on simple vs. main effects in regression models (http://talklab.psy.gla.ac.uk/tvw/catpred/). (?nlme::anova.lme does have indeed have a 'type' argument.) 2. It is possible for the sequential tests and the marginal tests to yield the same results. Again, see the above literature. You have no interactions in your model and continuous (i.e. not-categorical) predictors, so if they're orthogonal, then the sequential and marginal tests will be numerically the same, even if they test different hypotheses. (See section 5.2, starting on page 14; the sequential tests are the "eliminating" tests, while the marginal tests are the "ignoring" tests in that explanation.) Best, Phillip On 28/11/17 12:00, r-help-requ...@r-project.org wrote: > I wantto run sequential ANOVAs (i.e. type I sums of squares), and trying to > getresults including ANOVA tables and associated coefficients for predictive > variables(I am using the R 3.4.2 version). I think ANOVA tables look right, > but believecoefficients are wrong. Specifically, it looks like that the > coefficients arefrom ANOVA with ?marginal? (type III sums of squares). I have > tried both lme (nlmepackage) and lmer (lme4 + lmerTEST packages). Examples of > the results arebelow: > > Ibelieve the results from summary() are for ?marginal? instead of > ?sequential?ANOVA because the p-value (i.e., 0.237 for narea) in summary are > identical tothose in tables from ?marginal?. I also used lmer in the lme4 > pacakge to findthe same results (summary() results look like from ?marginal?). > > > Cananybody tell me how to get coefficients for ?sequential? ANOVAs? Thank you. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.