Re: [R] Removing a data subset

2017-11-29 Thread Ivan Calandra

Hi David,

You "just" need to learn how to subset your data.frame, see functions 
like ?subset and ?"[", as well as a good guide to understand the subtleties!


Some graphic functions also have a built-in argument to subset within 
the function (e.g. argument 'subset' in 'plot.formula'), although the 
ggplot() function doesn't seem to have it.


In any case, I would recommend you spend some time learning that aspect, 
as you will always need it in one situation or another.


HTH,
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 29/11/2017 22:07, David Doyle wrote:

Say I have a dataset that looks like

LocationYear  GW_Elv
MW011999   546.63
MW021999   474.21
MW031999   471.94
MW041999466.80
MW012000545.90
MW022000546.10

The whole dataset is at http://doylesdartden.com/ExampleData.csv
and I use the code below to do the graph but I want to do it without MW01.
How can I remove MW01??

I'm sure I can do it by SubSeting but I can not figure out how to do it.

Thank you
David

--

library(ggplot2)

MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE,
sep=",")



#Sets whic are detections and nondetects
MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect")

#Removes the NAs
MyDataWONA <- MyData[!is.na(MyData$Detections), ]

#does the plot
p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) +
   geom_point(aes(shape=Detections)) +

   ##sets the colors
   scale_colour_manual(values=c("black","red")) + #scale_y_log10() +

   #location of the legend
   theme(legend.position=c("right")) +

   #sets the line color, type and size
   geom_line(colour="black", linetype="dotted", size=0.5) +
   ylab("Elevation Feet Mean Sea Level")

## does the graph using the Location IDs as the different Locations.
p + facet_grid(Location ~ .)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] 2^3 confounded factorial experiment

2017-11-29 Thread David Winsemius

> On Nov 29, 2017, at 9:20 AM, Jyoti Bhogal  wrote:
> 
> The following R commands were written:
>> help.search("factorial")
>> data(npk)
>> npk
>> coef(npk.aov)
> 
> In the output of coef command, please explain me the interpretation of 
> coefficients of block1 to block 6 in this 2^3 confounded factorial experiment.

This is very much a statistics question and as such is off-topic (as it also 
would be off-topic on StackOverflow.) Rhelp is for persons having difficulty 
coding the R language itself.

Consider CrossValidated.com but read their posting help section first since 
this is really very terse question. Better would be to include the output and 
make your best interpretation so peolple get the sense you at least put in some 
individual effort.

> 
> Thanks.
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   
-Gehm's Corollary to Clarke's Third Law

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Jim Lemon
Hi again,
Typo in the last email. Should read "about 40 standard deviations".

Jim

On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon  wrote:
> Hi Robert,
> People want different levels of automation in the software they use.
> What concerns many of us is the desire for the function
> "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> Such users typically want something that justifies its use by being
> written by someone who seems to know what they're doing and lots of
> other people use it. One advantage of many R functions is their
> modular construction. This encourages users to at least consider the
> steps that are taken rather than just accept what comes out of that
> long tube.
>
> Take the contentious problem of outlier identification. If I just let
> the black box peel off some values, I don't know what I have lost. On
> the other hand, if I import data and examine it with a summary
> function, I may find that one woman has a height of 5.2 meters. I can
> range check by looking up the Guinness Book of Records. It's an
> outlier. I can estimate the probability of such a height.  Hmm, about
> 4 standard deviations above the mean. It's an outlier. I can attempt a
> Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> has been recorded as a metric value". It's not an outlier.
>
> The more R gravitates toward "black box" functions, the more some
> users are encouraged to let them do the work.You pays your money and
> you takes your chances.
>
> Jim
>
>
> On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins  wrote:
>> R has a very wide audience, clinical research, astronomy, psychology, and
>> so on and so on.
>> I would consider data analysis work to be three stages: data preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Jim Lemon
Hi Robert,
People want different levels of automation in the software they use.
What concerns many of us is the desire for the function
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
Such users typically want something that justifies its use by being
written by someone who seems to know what they're doing and lots of
other people use it. One advantage of many R functions is their
modular construction. This encourages users to at least consider the
steps that are taken rather than just accept what comes out of that
long tube.

Take the contentious problem of outlier identification. If I just let
the black box peel off some values, I don't know what I have lost. On
the other hand, if I import data and examine it with a summary
function, I may find that one woman has a height of 5.2 meters. I can
range check by looking up the Guinness Book of Records. It's an
outlier. I can estimate the probability of such a height.  Hmm, about
4 standard deviations above the mean. It's an outlier. I can attempt a
Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
has been recorded as a metric value". It's not an outlier.

The more R gravitates toward "black box" functions, the more some
users are encouraged to let them do the work.You pays your money and
you takes your chances.

Jim


On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins  wrote:
> R has a very wide audience, clinical research, astronomy, psychology, and
> so on and so on.
> I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report.
> This regards the process of getting the data ready for analysis and
> reporting, sometimes called "data cleaning" or "data munging" or "data
> wrangling".
>
> So as regards tools for data preparation, speaking to the highly diverse
> audience mentioned, here is my question:
>
> What do you want?
> Or are you already quite happy with the range of tools that is currently
> before you?
>
> [BTW,  I posed the same question last week to the r-devel list, and was
> advised that r-help might be a more suitable audience by one of the
> moderators.]
>
> Robert Wilkins
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] dplyr - add/expand rows

2017-11-29 Thread Martin Morgan

On 11/29/2017 05:47 PM, Tóth Dénes wrote:

Hi Martin,

On 11/29/2017 10:46 PM, Martin Morgan wrote:

On 11/29/2017 04:15 PM, Tóth Dénes wrote:

Hi,

A benchmarking study with an additional (data.table-based) solution. 


I don't think speed is the right benchmark (I do agree that 
correctness is!).


Well, agree, and sorry for the wording. It was really just an exercise 
and not a full evaluation of the approaches. When I read the avalanche 
of solutions neither of which mentioning data.table (my first choice for 
data.frame-manipulations), I became curious how a one-liner data.table 
code performs against the other solutions in terms of speed and 
readability.
Second, I quite often have the feeling that dplyr is extremely overused 
among novice (and sometimes even experienced) R users nowadays. This is 
unfortunate, as the present example also illustrates.


Another solution is Bill's approach and dplyr's implementation (adding 
the 1L to keep integers integers!)


fun_bill1 <- function(d) {
  i <- rep(seq_len(nrow(d)), d$to - d$from + 1L)
  j <- sequence(d$to - d$from + 1L)
  ## d[i,] %>% mutate(year = from + j - 1L, from = NULL, to = NULL)
  mutate(d[i,], year = from + j - 1L, from = NULL, to = NULL)
}

which is competitive with IRanges and data.table (the more dplyr-ish? 
solution


  d[i, ] %>% mutate(year = from + j - 1L) %>%
  select(station, record, year))

has intermediate performance) and might appeal to those introduced to R 
through dplyr but wanting more base R knowledge, and vice versa. I think 
if dplyr introduces new users to R, or exposes R users to new approaches 
for working with data, that's great!


Martin




Regards,
Denes



For the R-help list, maybe something about least specialized R 
knowledge required would be appropriate? I'd say there were some 
'hard' solutions -- Michael (deep understanding of Bioconductor and 
IRanges), Toth (deep understanding of data.table), Jim (at least for 
me moderate understanding of dplyr,especially the .$ notation; a 
simpler dplyr answer might have moved this response out of the 
'difficult' category, especially given the familiarity of the OP with 
dplyr). I'd vote for Bill's as requiring the least specialized 
knowledge of R (though the +/- 1 indexing is an easy thing to get wrong).


A different criteria might be reuse across analysis scenarios. Bill 
seems to win here again, since the principles are very general and at 
least moderately efficient (both Bert and Martin's solutions are 
essentially R-level iterations and have poor scalability, as 
demonstrated in the microbenchmarks; Bill's is mostly vectorized). 
Certainly data.table, dplyr, and IRanges are extremely useful within 
the confines of the problem domains they address.


Martin


Enjoy! ;)

Cheers,
Denes


--


## packages ##

library(dplyr)
library(data.table)
library(IRanges)
library(microbenchmark)

## prepare example dataset ###

## use Bert's example, with 2000 stations instead of 2
d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
 from = as.integer(c(60,61,71,72,76,60,65,82,83)),
 to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
 record = c("A","B","C","B","D","B","B","D","E"),
 stringsAsFactors = FALSE)
stations <- rle(d_df$station)
stations$value <- gsub(
   " ", "0",
   paste0("station", format(1:length(stations$value), width = 6)))
d_df$station <- rep(stations$value, stations$lengths)

## prepare tibble and data.table versions
d_tbl <- as_tibble(d_df)
d_dt <- as.data.table(d_df)

## solutions ##

## Bert - by
fun_bert <- function(d) {
   out <- by(
 d, d$station, function(x) with(x, {
   i <- to - from +1
   data.frame(record =rep(record,i),
  year =sequence(i) -1 + rep(from,i),
  stringsAsFactors = FALSE)
 }))
   data.frame(station = rep(names(out), sapply(out,nrow)),
  do.call(rbind,out),
  row.names = NULL,
  stringsAsFactors = FALSE)
}

## Bill - transform
fun_bill <- function(d) {
   i <- rep(seq_len(nrow(d)), d$to-d$from+1)
   j <- sequence(d$to-d$from+1)
   transform(d[i,], year=from+j-1, from=NULL, to=NULL)
}

## Michael - IRanges
fun_michael <- function(d) {
   df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
   expand(df, "year")
}

## Jim - dplyr
fun_jim <- function(d) {
   d %>%
 rowwise() %>%
 do(tibble(station = .$station,
   record = .$record,
   year = seq(.$from, .$to))
 )
}

## Martin - Map
fun_martin <- function(d) {
   d$year <- with(d, Map(seq, from, to))
   res0 <- with(d, Map(data.frame,
   station=station,
   record=record,
   year=year,
   MoreArgs = list(stringsAsFactors = FALSE)))
   do.call(rbind, unname(res0))
}

## Denes - simple data.table

Re: [R] dplyr - add/expand rows

2017-11-29 Thread Tóth Dénes

Hi Martin,

On 11/29/2017 10:46 PM, Martin Morgan wrote:

On 11/29/2017 04:15 PM, Tóth Dénes wrote:

Hi,

A benchmarking study with an additional (data.table-based) solution. 


I don't think speed is the right benchmark (I do agree that correctness 
is!).


Well, agree, and sorry for the wording. It was really just an exercise 
and not a full evaluation of the approaches. When I read the avalanche 
of solutions neither of which mentioning data.table (my first choice for 
data.frame-manipulations), I became curious how a one-liner data.table 
code performs against the other solutions in terms of speed and 
readability.
Second, I quite often have the feeling that dplyr is extremely overused 
among novice (and sometimes even experienced) R users nowadays. This is 
unfortunate, as the present example also illustrates.


Regards,
Denes



For the R-help list, maybe something about least specialized R knowledge 
required would be appropriate? I'd say there were some 'hard' solutions 
-- Michael (deep understanding of Bioconductor and IRanges), Toth (deep 
understanding of data.table), Jim (at least for me moderate 
understanding of dplyr,especially the .$ notation; a simpler dplyr 
answer might have moved this response out of the 'difficult' category, 
especially given the familiarity of the OP with dplyr). I'd vote for 
Bill's as requiring the least specialized knowledge of R (though the +/- 
1 indexing is an easy thing to get wrong).


A different criteria might be reuse across analysis scenarios. Bill 
seems to win here again, since the principles are very general and at 
least moderately efficient (both Bert and Martin's solutions are 
essentially R-level iterations and have poor scalability, as 
demonstrated in the microbenchmarks; Bill's is mostly vectorized). 
Certainly data.table, dplyr, and IRanges are extremely useful within the 
confines of the problem domains they address.


Martin


Enjoy! ;)

Cheers,
Denes


--


## packages ##

library(dplyr)
library(data.table)
library(IRanges)
library(microbenchmark)

## prepare example dataset ###

## use Bert's example, with 2000 stations instead of 2
d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
 from = as.integer(c(60,61,71,72,76,60,65,82,83)),
 to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
 record = c("A","B","C","B","D","B","B","D","E"),
 stringsAsFactors = FALSE)
stations <- rle(d_df$station)
stations$value <- gsub(
   " ", "0",
   paste0("station", format(1:length(stations$value), width = 6)))
d_df$station <- rep(stations$value, stations$lengths)

## prepare tibble and data.table versions
d_tbl <- as_tibble(d_df)
d_dt <- as.data.table(d_df)

## solutions ##

## Bert - by
fun_bert <- function(d) {
   out <- by(
 d, d$station, function(x) with(x, {
   i <- to - from +1
   data.frame(record =rep(record,i),
  year =sequence(i) -1 + rep(from,i),
  stringsAsFactors = FALSE)
 }))
   data.frame(station = rep(names(out), sapply(out,nrow)),
  do.call(rbind,out),
  row.names = NULL,
  stringsAsFactors = FALSE)
}

## Bill - transform
fun_bill <- function(d) {
   i <- rep(seq_len(nrow(d)), d$to-d$from+1)
   j <- sequence(d$to-d$from+1)
   transform(d[i,], year=from+j-1, from=NULL, to=NULL)
}

## Michael - IRanges
fun_michael <- function(d) {
   df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
   expand(df, "year")
}

## Jim - dplyr
fun_jim <- function(d) {
   d %>%
 rowwise() %>%
 do(tibble(station = .$station,
   record = .$record,
   year = seq(.$from, .$to))
 )
}

## Martin - Map
fun_martin <- function(d) {
   d$year <- with(d, Map(seq, from, to))
   res0 <- with(d, Map(data.frame,
   station=station,
   record=record,
   year=year,
   MoreArgs = list(stringsAsFactors = FALSE)))
   do.call(rbind, unname(res0))
}

## Denes - simple data.table
fun_denes <- function(d) {
   out <- d[, .(year = from:to), by = .(station, from, record)]
   out[, from := NULL]
}

## Check equality 
all.equal(fun_bill(d_df), fun_bert(d_df),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), fun_martin(d_df),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
   check.attributes = FALSE)
# Be prepared: this solution is super slow
all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
   check.attributes = FALSE)

## Benchmark #

## Martin
print(system.time(fun_martin(d_df)))

## Bert
print(system.time(fun_bert(d_df)))

## Top 3
print(
   microbenchmark(
  

Re: [R] Removing a data subset

2017-11-29 Thread Rainer Schuermann
Reading in the data from the file

x <- read.csv( "ExampleData.csv", header = TRUE, stringsAsFactors = FALSE )

Subsetting  as you want

x <- x[ x$Location != "MW01", ]

This selects all rows where the value in column 'Location' is not equal to 
"MW01". The comma after that ensures that all columns are copied into the 
amended data.frame.

Rgds,
Rainer

On Mittwoch, 29. November 2017 15:07:34 +08 David Doyle wrote:
> Say I have a dataset that looks like
> 
> LocationYear  GW_Elv
> MW011999   546.63
> MW021999   474.21
> MW031999   471.94
> MW041999466.80
> MW012000545.90
> MW022000546.10
> 
> The whole dataset is at http://doylesdartden.com/ExampleData.csv
> and I use the code below to do the graph but I want to do it without MW01.
> How can I remove MW01??
> 
> I'm sure I can do it by SubSeting but I can not figure out how to do it.
> 
> Thank you
> David
> 
> --
> 
> library(ggplot2)
> 
> MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE,
> sep=",")
> 
> 
> 
> #Sets whic are detections and nondetects
> MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect")
> 
> #Removes the NAs
> MyDataWONA <- MyData[!is.na(MyData$Detections), ]
> 
> #does the plot
> p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) +
>   geom_point(aes(shape=Detections)) +
> 
>   ##sets the colors
>   scale_colour_manual(values=c("black","red")) + #scale_y_log10() +
> 
>   #location of the legend
>   theme(legend.position=c("right")) +
> 
>   #sets the line color, type and size
>   geom_line(colour="black", linetype="dotted", size=0.5) +
>   ylab("Elevation Feet Mean Sea Level")
> 
> ## does the graph using the Location IDs as the different Locations.
> p + facet_grid(Location ~ .)
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] dplyr - add/expand rows

2017-11-29 Thread Martin Morgan

On 11/29/2017 04:15 PM, Tóth Dénes wrote:

Hi,

A benchmarking study with an additional (data.table-based) solution. 


I don't think speed is the right benchmark (I do agree that correctness 
is!).


For the R-help list, maybe something about least specialized R knowledge 
required would be appropriate? I'd say there were some 'hard' solutions 
-- Michael (deep understanding of Bioconductor and IRanges), Toth (deep 
understanding of data.table), Jim (at least for me moderate 
understanding of dplyr,especially the .$ notation; a simpler dplyr 
answer might have moved this response out of the 'difficult' category, 
especially given the familiarity of the OP with dplyr). I'd vote for 
Bill's as requiring the least specialized knowledge of R (though the +/- 
1 indexing is an easy thing to get wrong).


A different criteria might be reuse across analysis scenarios. Bill 
seems to win here again, since the principles are very general and at 
least moderately efficient (both Bert and Martin's solutions are 
essentially R-level iterations and have poor scalability, as 
demonstrated in the microbenchmarks; Bill's is mostly vectorized). 
Certainly data.table, dplyr, and IRanges are extremely useful within the 
confines of the problem domains they address.


Martin


Enjoy! ;)

Cheers,
Denes


--


## packages ##

library(dplyr)
library(data.table)
library(IRanges)
library(microbenchmark)

## prepare example dataset ###

## use Bert's example, with 2000 stations instead of 2
d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
     from = as.integer(c(60,61,71,72,76,60,65,82,83)),
     to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
     record = c("A","B","C","B","D","B","B","D","E"),
     stringsAsFactors = FALSE)
stations <- rle(d_df$station)
stations$value <- gsub(
   " ", "0",
   paste0("station", format(1:length(stations$value), width = 6)))
d_df$station <- rep(stations$value, stations$lengths)

## prepare tibble and data.table versions
d_tbl <- as_tibble(d_df)
d_dt <- as.data.table(d_df)

## solutions ##

## Bert - by
fun_bert <- function(d) {
   out <- by(
     d, d$station, function(x) with(x, {
   i <- to - from +1
   data.frame(record =rep(record,i),
  year =sequence(i) -1 + rep(from,i),
  stringsAsFactors = FALSE)
     }))
   data.frame(station = rep(names(out), sapply(out,nrow)),
  do.call(rbind,out),
  row.names = NULL,
  stringsAsFactors = FALSE)
}

## Bill - transform
fun_bill <- function(d) {
   i <- rep(seq_len(nrow(d)), d$to-d$from+1)
   j <- sequence(d$to-d$from+1)
   transform(d[i,], year=from+j-1, from=NULL, to=NULL)
}

## Michael - IRanges
fun_michael <- function(d) {
   df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
   expand(df, "year")
}

## Jim - dplyr
fun_jim <- function(d) {
   d %>%
     rowwise() %>%
     do(tibble(station = .$station,
   record = .$record,
   year = seq(.$from, .$to))
     )
}

## Martin - Map
fun_martin <- function(d) {
   d$year <- with(d, Map(seq, from, to))
   res0 <- with(d, Map(data.frame,
   station=station,
   record=record,
   year=year,
   MoreArgs = list(stringsAsFactors = FALSE)))
   do.call(rbind, unname(res0))
}

## Denes - simple data.table
fun_denes <- function(d) {
   out <- d[, .(year = from:to), by = .(station, from, record)]
   out[, from := NULL]
}

## Check equality 
all.equal(fun_bill(d_df), fun_bert(d_df),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), fun_martin(d_df),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
   check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
   check.attributes = FALSE)
# Be prepared: this solution is super slow
all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
   check.attributes = FALSE)

## Benchmark #

## Martin
print(system.time(fun_martin(d_df)))

## Bert
print(system.time(fun_bert(d_df)))

## Top 3
print(
   microbenchmark(
     fun_bill(d_df),
     fun_michael(d_df),
     fun_denes(d_dt),
     times = 100L
   )
)


-

On 11/28/2017 06:49 PM, Michael Lawrence wrote:

Or with the Bioconductor IRanges package:

df <- with(input, DataFrame(station, year=IRanges(from, to), record))
expand(df, "year")

DataFrame with 24 rows and 3 columns
 station year  record
   
1   07EA001  1960 QMS
2   07EA001  1961 QMC
3   07EA001  1962 QMC
4   07EA001  1963 QMC
5   07EA001  1964 QMC
... ...   ... ...
20  07EA001  1979 

Re: [R] dplyr - add/expand rows

2017-11-29 Thread Tóth Dénes

Hi,

A benchmarking study with an additional (data.table-based) solution. 
Enjoy! ;)


Cheers,
Denes


--


## packages ##

library(dplyr)
library(data.table)
library(IRanges)
library(microbenchmark)

## prepare example dataset ###

## use Bert's example, with 2000 stations instead of 2
d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
from = as.integer(c(60,61,71,72,76,60,65,82,83)),
to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
record = c("A","B","C","B","D","B","B","D","E"),
stringsAsFactors = FALSE)
stations <- rle(d_df$station)
stations$value <- gsub(
  " ", "0",
  paste0("station", format(1:length(stations$value), width = 6)))
d_df$station <- rep(stations$value, stations$lengths)

## prepare tibble and data.table versions
d_tbl <- as_tibble(d_df)
d_dt <- as.data.table(d_df)

## solutions ##

## Bert - by
fun_bert <- function(d) {
  out <- by(
d, d$station, function(x) with(x, {
  i <- to - from +1
  data.frame(record =rep(record,i),
 year =sequence(i) -1 + rep(from,i),
 stringsAsFactors = FALSE)
}))
  data.frame(station = rep(names(out), sapply(out,nrow)),
 do.call(rbind,out),
 row.names = NULL,
 stringsAsFactors = FALSE)
}

## Bill - transform
fun_bill <- function(d) {
  i <- rep(seq_len(nrow(d)), d$to-d$from+1)
  j <- sequence(d$to-d$from+1)
  transform(d[i,], year=from+j-1, from=NULL, to=NULL)
}

## Michael - IRanges
fun_michael <- function(d) {
  df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
  expand(df, "year")
}

## Jim - dplyr
fun_jim <- function(d) {
  d %>%
rowwise() %>%
do(tibble(station = .$station,
  record = .$record,
  year = seq(.$from, .$to))
)
}

## Martin - Map
fun_martin <- function(d) {
  d$year <- with(d, Map(seq, from, to))
  res0 <- with(d, Map(data.frame,
  station=station,
  record=record,
  year=year,
  MoreArgs = list(stringsAsFactors = FALSE)))
  do.call(rbind, unname(res0))
}

## Denes - simple data.table
fun_denes <- function(d) {
  out <- d[, .(year = from:to), by = .(station, from, record)]
  out[, from := NULL]
}

## Check equality 
all.equal(fun_bill(d_df), fun_bert(d_df),
  check.attributes = FALSE)
all.equal(fun_bill(d_df), fun_martin(d_df),
  check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
  check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
  check.attributes = FALSE)
# Be prepared: this solution is super slow
all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
  check.attributes = FALSE)

## Benchmark #

## Martin
print(system.time(fun_martin(d_df)))

## Bert
print(system.time(fun_bert(d_df)))

## Top 3
print(
  microbenchmark(
fun_bill(d_df),
fun_michael(d_df),
fun_denes(d_dt),
times = 100L
  )
)


-

On 11/28/2017 06:49 PM, Michael Lawrence wrote:

Or with the Bioconductor IRanges package:

df <- with(input, DataFrame(station, year=IRanges(from, to), record))
expand(df, "year")

DataFrame with 24 rows and 3 columns
 station year  record
   
1   07EA001  1960 QMS
2   07EA001  1961 QMC
3   07EA001  1962 QMC
4   07EA001  1963 QMC
5   07EA001  1964 QMC
... ...   ... ...
20  07EA001  1979 QRC
21  07EA001  1980 QRC
22  07EA001  1981 QRC
23  07EA001  1982 QRC
24  07EA001  1983 QRC

If you tell the computer more about your data, it can do more things for
you.

Michael

On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
martin.mor...@roswellpark.org> wrote:


On 11/26/2017 08:42 PM, jim holtman wrote:


try this:

##

library(dplyr)

input <- tribble(
~station, ~from, ~to, ~record,
   "07EA001" ,1960  ,  1960  , "QMS",
   "07EA001"  ,   1961 ,   1970  , "QMC",
   "07EA001" ,1971  ,  1971  , "QMM",
   "07EA001" ,1972  ,  1976  , "QMC",
   "07EA001" ,1977  ,  1983  , "QRC"
)

result <- input %>%
rowwise() %>%
do(tibble(station = .$station,
  year = seq(.$from, .$to),
  record = .$record)
)

###



In a bit more 'base R' mode I did

   input$year <- with(input, Map(seq, from, to))
   res0 <- with(input, Map(data.frame, station=station, year=year,
   record=record))
as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3

resulting in


as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3

station  year record
   
  1 

[R] Removing a data subset

2017-11-29 Thread David Doyle
Say I have a dataset that looks like

LocationYear  GW_Elv
MW011999   546.63
MW021999   474.21
MW031999   471.94
MW041999466.80
MW012000545.90
MW022000546.10

The whole dataset is at http://doylesdartden.com/ExampleData.csv
and I use the code below to do the graph but I want to do it without MW01.
How can I remove MW01??

I'm sure I can do it by SubSeting but I can not figure out how to do it.

Thank you
David

--

library(ggplot2)

MyData <- read.csv("http://doylesdartden.com/ExampleData.csv;, header=TRUE,
sep=",")



#Sets whic are detections and nondetects
MyData$Detections <- ifelse(MyData$D_GW_Elv ==1, "Detected", "NonDetect")

#Removes the NAs
MyDataWONA <- MyData[!is.na(MyData$Detections), ]

#does the plot
p <- ggplot(data = MyDataWONA, aes(x=Year, y=GW_Elv , col=Detections)) +
  geom_point(aes(shape=Detections)) +

  ##sets the colors
  scale_colour_manual(values=c("black","red")) + #scale_y_log10() +

  #location of the legend
  theme(legend.position=c("right")) +

  #sets the line color, type and size
  geom_line(colour="black", linetype="dotted", size=0.5) +
  ylab("Elevation Feet Mean Sea Level")

## does the graph using the Location IDs as the different Locations.
p + facet_grid(Location ~ .)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] 2^3 confounded factorial experiment

2017-11-29 Thread Jyoti Bhogal
The following R commands were written:
>help.search("factorial")
>data(npk)
>npk
>coef(npk.aov)

In the output of coef command, please explain me the interpretation of 
coefficients of block1 to block 6 in this 2^3 confounded factorial experiment.

Thanks.
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAMseq errors

2017-11-29 Thread Jeff Newmiller
A) This list is a general interest list on the R language... you have posed 
your question as if you were looking for domain experts such as you might be 
more likely to find on the Bioconductor mailing list. 

B) Example is not reproducible. [1][2][3]

C) Just because your data don't have missing values does not mean that your 
early analysis steps don't create them, e.g. by taking the logarithm of 
negative numbers. Look at intermediate values in your analysis, and read the 
documentation for steps you are treating as "magic black boxes".

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

[2] http://adv-r.had.co.nz/Reproducibility.html

[3] https://cran.r-project.org/web/packages/reprex/index.html (read the 
vignette)
-- 
Sent from my phone. Please excuse my brevity.

On November 29, 2017 9:39:24 AM PST, array chip via R-help 
 wrote:
>Sorry forgot to use plain text format, hope this time it works:
>
>Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment
>(2 genes x 550 samples) with survival endpoint. It quickly give the
>following error:
>
>> library(samr)
>Loading required package: impute
>Loading required package: matrixStats
>
>Attaching package: ‘matrixStats’
>
>The following objects are masked from ‘package:Biobase’:
>
>    anyMissing, rowMedians
>
>Warning messages:
>1: package ‘samr’ was built under R version 3.3.3 
>2: package ‘matrixStats’ was built under R version 3.3.3
>
>> samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status,
>resp.type="Survival")
>
>Estimating sequencing depths...
>Error in quantile.default(prop, c(0.25, 0.75)) : 
>  missing values and NaN's not allowed if 'na.rm' is FALSE
>In addition: Warning message:
>In sum(x) : integer overflow - use sum(as.numeric(.))
>Error during wrapup: cannot open the connection
>
>> sessionInfo()
>R version 3.3.2 (2016-10-31)
>Platform: x86_64-w64-mingw32/x64 (64-bit)
>Running under: Windows 7 x64 (build 7601) Service Pack 1
>
>locale:
>[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>States.1252    LC_MONETARY=English_United States.1252
>[4] LC_NUMERIC=C                           LC_TIME=English_United
>States.1252    
>
>attached base packages:
>[1] stats     graphics  grDevices datasets  utils     methods   base   
> 
>
>other attached packages:
>[1] samr_2.0             matrixStats_0.52.2   impute_1.48.0       
>BiocInstaller_1.24.0 rcom_3.1-3           rscproxy_2.1-1      
>
>loaded via a namespace (and not attached):
>[1] tools_3.3.2
>
>
>I checked, my data matrix and y variables have no missing values.
>Anyone has suggestions what's going on?
>
>Thank you!
>
>John
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] SAMseq errors

2017-11-29 Thread array chip via R-help
Sorry forgot to use plain text format, hope this time it works:

Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment (2 genes 
x 550 samples) with survival endpoint. It quickly give the following error:

> library(samr)
Loading required package: impute
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

Warning messages:
1: package ‘samr’ was built under R version 3.3.3 
2: package ‘matrixStats’ was built under R version 3.3.3

> samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status, 
> resp.type="Survival")

Estimating sequencing depths...
Error in quantile.default(prop, c(0.25, 0.75)) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
In addition: Warning message:
In sum(x) : integer overflow - use sum(as.numeric(.))
Error during wrapup: cannot open the connection

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
  LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252   
 

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] samr_2.0             matrixStats_0.52.2   impute_1.48.0        
BiocInstaller_1.24.0 rcom_3.1-3           rscproxy_2.1-1      

loaded via a namespace (and not attached):
[1] tools_3.3.2


I checked, my data matrix and y variables have no missing values. Anyone has 
suggestions what's going on?

Thank you!

John

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] SAMseq errors

2017-11-29 Thread array chip via R-help
Hi, I am trying to using SAMseq() to analyze my RNA-seq experiment (2 genes 
x 550 samples) with survival endpoint. It quickly give the following error:
> library(samr)Loading required package: imputeLoading required package: 
> matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
    anyMissing, rowMedians
Warning messages:1: package ‘samr’ was built under R version 3.3.3 2: package 
‘matrixStats’ was built under R version 3.3.3
> samfit<-SAMseq(data, PFI.time,censoring.status=PFI.status, 
> resp.type="Survival")
Estimating sequencing depths...Error in quantile.default(prop, c(0.25, 0.75)) : 
  missing values and NaN's not allowed if 'na.rm' is FALSEIn addition: Warning 
message:In sum(x) : integer overflow - use sum(as.numeric(.))Error during 
wrapup: cannot open the connection
> sessionInfo()R version 3.3.2 (2016-10-31)Platform: x86_64-w64-mingw32/x64 
> (64-bit)Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
States.1252    LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C           
                LC_TIME=English_United States.1252    
attached base packages:[1] stats     graphics  grDevices datasets  utils     
methods   base     
other attached packages:[1] samr_2.0             matrixStats_0.52.2   
impute_1.48.0        BiocInstaller_1.24.0 rcom_3.1-3           rscproxy_2.1-1   
   
loaded via a namespace (and not attached):[1] tools_3.3.2

I checked, my data matrix and y variables have no missing values. Anyone has 
suggestions what's going on?
Thank you!
John

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Robert Wilkins
Christopher,

OK, well what about a range of functions in an R package that
automatically, with very little syntax, pulls in data from a variety of
formats (CSV, SQLite, and so on) and converts them to an R data frame. You
seem to be pointing to something like that.
Something like that, in some form or another, probably already exists,
though it might be either imperfect (not as user-friendly as possible) or
not well publicised, or both.
Or another tangent: your co-workers are not going to stop using Excel,
whether you like it or not, and many end-users are stuck in the exact same
position as you (co-workers who deliver the data in Excel). I will guess
that data stored in Excel tends to be dirty in somewhat predictable ways.
(And again, those other end-user's coworkers are not going to change their
behaviour). And so: a data munging tool that makes it as easy as possible
to clean up the data in Excel spreadsheets and export them to R data
frames. One prerequisite: an understanding of what tends to go wrong with
data with Excel ( the data in Excel tends to be dirty, but dirty in what
way?).

Thank you for your response Christopher. What state are you in?


On Wed, Nov 29, 2017 at 11:52 AM, Christopher W. Ryan 
wrote:

> Great question. What do I want? I want my co-workers to stop using Excel
> spreadsheets for data entry, storage, and sharing! I want them to
> understand the value of data discipline. But alas . . . .
>
> I work in a county health department in the US. Between dplyr, stringr,
> grep, grepl, and the base R read() functions, I'm doing OK.
>
> I need to learn more about APIs, so I can see if I can make R directly
> grab data from, e.g. our state health department sources. My biggest
> hassle is having to download a data file, save it somewhere, and then
> open R and read it in. I'd like to be able to do it all in R. Would make
> the generation of recurring reports easier.
>
> --Chris Ryan
>
> Robert Wilkins wrote:
> > R has a very wide audience, clinical research, astronomy, psychology, and
> > so on and so on.
> > I would consider data analysis work to be three stages: data preparation,
> > statistical analysis, and producing the report.
> > This regards the process of getting the data ready for analysis and
> > reporting, sometimes called "data cleaning" or "data munging" or "data
> > wrangling".
> >
> > So as regards tools for data preparation, speaking to the highly diverse
> > audience mentioned, here is my question:
> >
> > What do you want?
> > Or are you already quite happy with the range of tools that is currently
> > before you?
> >
> > [BTW,  I posed the same question last week to the r-devel list, and was
> > advised that r-help might be a more suitable audience by one of the
> > moderators.]
> >
> > Robert Wilkins
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Bert Gunter
Oh Crap! I mistakenly replied onlist. PLEASE IGNORE -- these are only my
ignorant opinions.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Nov 29, 2017 at 8:48 AM, Bert Gunter  wrote:

> I don't think my view is of interest to many, so offlist.
>
> I reject this:
>
> " I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report."
>
> For example, there is no such thing as "outliers" -- data to be removed as
> part of cleaning/preparation -- without a statistical model to be an
> "outlier" **from**, which is part of the statistical analysis. And the
> structure of the data (data preparation) may need to change depending on
> the course of the analysis (including graphics, also part of the analysis).
> So I think your view reflects a naïve view of the nature of data analysis,
> which is an iterative and holistic process. I suspect your training is as a
> computer scientist and you have not done much 1-1 consulting with
> researchers, though you should certainly feel free to reject this canard.
> Building software for large scale automated analysis of data required a
> much different analytical paradigm than the statistical consulting model,
> which is largely my background.
>
> No reply necessary. Just my opinion, which you are of course free to trash.
>
> Cheers,
> Bert
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins 
> wrote:
>
>> R has a very wide audience, clinical research, astronomy, psychology, and
>> so on and so on.
>> I would consider data analysis work to be three stages: data preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Christopher W. Ryan
Great question. What do I want? I want my co-workers to stop using Excel
spreadsheets for data entry, storage, and sharing! I want them to
understand the value of data discipline. But alas . . . .

I work in a county health department in the US. Between dplyr, stringr,
grep, grepl, and the base R read() functions, I'm doing OK.

I need to learn more about APIs, so I can see if I can make R directly
grab data from, e.g. our state health department sources. My biggest
hassle is having to download a data file, save it somewhere, and then
open R and read it in. I'd like to be able to do it all in R. Would make
the generation of recurring reports easier.

--Chris Ryan

Robert Wilkins wrote:
> R has a very wide audience, clinical research, astronomy, psychology, and
> so on and so on.
> I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report.
> This regards the process of getting the data ready for analysis and
> reporting, sometimes called "data cleaning" or "data munging" or "data
> wrangling".
> 
> So as regards tools for data preparation, speaking to the highly diverse
> audience mentioned, here is my question:
> 
> What do you want?
> Or are you already quite happy with the range of tools that is currently
> before you?
> 
> [BTW,  I posed the same question last week to the r-devel list, and was
> advised that r-help might be a more suitable audience by one of the
> moderators.]
> 
> Robert Wilkins
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Bert Gunter
I don't think my view is of interest to many, so offlist.

I reject this:

" I would consider data analysis work to be three stages: data preparation,
statistical analysis, and producing the report."

For example, there is no such thing as "outliers" -- data to be removed as
part of cleaning/preparation -- without a statistical model to be an
"outlier" **from**, which is part of the statistical analysis. And the
structure of the data (data preparation) may need to change depending on
the course of the analysis (including graphics, also part of the analysis).
So I think your view reflects a naïve view of the nature of data analysis,
which is an iterative and holistic process. I suspect your training is as a
computer scientist and you have not done much 1-1 consulting with
researchers, though you should certainly feel free to reject this canard.
Building software for large scale automated analysis of data required a
much different analytical paradigm than the statistical consulting model,
which is largely my background.

No reply necessary. Just my opinion, which you are of course free to trash.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins 
wrote:

> R has a very wide audience, clinical research, astronomy, psychology, and
> so on and so on.
> I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report.
> This regards the process of getting the data ready for analysis and
> reporting, sometimes called "data cleaning" or "data munging" or "data
> wrangling".
>
> So as regards tools for data preparation, speaking to the highly diverse
> audience mentioned, here is my question:
>
> What do you want?
> Or are you already quite happy with the range of tools that is currently
> before you?
>
> [BTW,  I posed the same question last week to the r-devel list, and was
> advised that r-help might be a more suitable audience by one of the
> moderators.]
>
> Robert Wilkins
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Robert Wilkins
R has a very wide audience, clinical research, astronomy, psychology, and
so on and so on.
I would consider data analysis work to be three stages: data preparation,
statistical analysis, and producing the report.
This regards the process of getting the data ready for analysis and
reporting, sometimes called "data cleaning" or "data munging" or "data
wrangling".

So as regards tools for data preparation, speaking to the highly diverse
audience mentioned, here is my question:

What do you want?
Or are you already quite happy with the range of tools that is currently
before you?

[BTW,  I posed the same question last week to the r-devel list, and was
advised that r-help might be a more suitable audience by one of the
moderators.]

Robert Wilkins

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Preventing repeated package installation, or pre installing packages

2017-11-29 Thread Thierry Onkelinx
Dear Larry,

Have a look at https://github.com/inbo/rstable That is a dockerfile
with a stable version of R and a set of packages.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkel...@inbo.be
Kliniekstraat 25, B-1070 Brussel
www.inbo.be

///
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///


Van 14 tot en met 19 december 2017 verhuizen we uit onze vestiging in
Brussel naar het Herman Teirlinckgebouw op de site Thurn & Taxis.
Vanaf dan ben je welkom op het nieuwe adres: Havenlaan 88 bus 73, 1000 Brussel.

///



2017-11-29 15:28 GMT+01:00 Larry Martell :
> I have a R script that I call from python using rpy2. It uses dplyr, doBy,
> and ggplot2. The script has install.packages commands for these 3 packages.
> Even thought the packages are already installed it still downloads,
> builds, and installs them, which is very time consuming. Is there a way to
> have it only do the install if the package is not already installed?
>
> Also, I run in a docker container, so after the container is instantiated
> the packages are not there the first time the script runs. Is there a way
> to pre load the packages, in which case I would not need the
> install.packages commands for these packages and my above question would
> become moot.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Preventing repeated package installation, or pre installing packages

2017-11-29 Thread Rainer Krug


> On 29 Nov 2017, at 15:28, Larry Martell  wrote:
> 
> I have a R script that I call from python using rpy2. It uses dplyr, doBy,
> and ggplot2. The script has install.packages commands for these 3 packages.
> Even thought the packages are already installed it still downloads,
> builds, and installs them, which is very time consuming. Is there a way to
> have it only do the install if the package is not already installed?

You could use something like


if (!require(dplyr)) {
install.packages(“dplyr”)
library(dplyr)
}

where require() returns FALSE if it fails to load the package.


> 
> Also, I run in a docker container, so after the container is instantiated
> the packages are not there the first time the script runs. Is there a way
> to pre load the packages, in which case I would not need the
> install.packages commands for these packages and my above question would
> become moot.

Yes - add them to you Docker file, but this is a docker question, not R. Check 
out the Rocker Dockerfiles to see how you can do this.

Rainer

> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, 
UCT), Dipl. Phys. (Germany)

University of Zürich

Cell:   +41 (0)78 630 66 57
email:  rai...@krugs.de
Skype:  RMkrug

PGP: 0x0F52F982





signature.asc
Description: Message signed with OpenPGP
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Preventing repeated package installation, or pre installing packages

2017-11-29 Thread Michael Dewey

Dear Larry

As far as your first question is concerned I think one of require or 
requireNamespace may be what you need.


Michael

On 29/11/2017 14:28, Larry Martell wrote:

I have a R script that I call from python using rpy2. It uses dplyr, doBy,
and ggplot2. The script has install.packages commands for these 3 packages.
Even thought the packages are already installed it still downloads,
builds, and installs them, which is very time consuming. Is there a way to
have it only do the install if the package is not already installed?

Also, I run in a docker container, so after the container is instantiated
the packages are not there the first time the script runs. Is there a way
to pre load the packages, in which case I would not need the
install.packages commands for these packages and my above question would
become moot.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Michael
http://www.dewey.myzen.co.uk/home.html

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Preventing repeated package installation, or pre installing packages

2017-11-29 Thread Larry Martell
I have a R script that I call from python using rpy2. It uses dplyr, doBy,
and ggplot2. The script has install.packages commands for these 3 packages.
Even thought the packages are already installed it still downloads,
builds, and installs them, which is very time consuming. Is there a way to
have it only do the install if the package is not already installed?

Also, I run in a docker container, so after the container is instantiated
the packages are not there the first time the script runs. Is there a way
to pre load the packages, in which case I would not need the
install.packages commands for these packages and my above question would
become moot.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] DeSolve Package and Moving Average

2017-11-29 Thread Eric Berger
Since you only provide pseudo-code I will give a guess as to the source of
the problem.
It is easy to get "burned" by use of the ifelse statement. Its results have
the same "shape" as the first argument.
My suggestion is to try replacing ifelse by a standard

if (  ) {
} else {
}

HTH,
Eric



On Wed, Nov 29, 2017 at 1:29 PM, Werning, Jan-Philipp <
jan-philipp.wern...@whu.edu> wrote:

> Dear all,
>
>
> I am using the DeSolve Package to simulate a system dynamics model. At the
> problematic point in the model, I basically want to decide how many
> products shall be produced to be sold. In order to determine the amount a
> basic forecasting model of using the average of the last 12 time periods
> shall be used. My code looks like the following.
>
> “ […]
>
> # Time units in month
> START<-0; FINISH<-120; STEP<-1
>
> # Set seed for reproducability
>
>  set.seed(123)
>
> # Create time vector
> simtime  <- seq(START, FINISH, by=STEP)
>
> # Create a stock vector with initial values
> stocks   <- c([…])
>
> # Create an aux vector for the fixed aux values
> auxs<- c([…])
>
>
> model <- function(time, stocks, auxs){
>   with(as.list(c(stocks, auxs)),{
>
> [… “lots of aux, flow, and stock functions” … ]
>
>
> aMovingAverage  <-  ifelse(exists("ResultsSimulation")=="FALSE",
> 1,movavg(ResultsSimulation$TotalSales, 12, type = "s”))
>
>
> return (list(c([…]))
>
>   })
> }
>
> # Call Solver, and store results in a data frame
> ResultsSimulation <-  data.frame(ode(y=stocks, times=simtime, func = model,
>   parms=auxs, method="euler"))
>
> […]”
>
> My problem is, that the moving average (function: movavg) is only computed
> once and the same value is used in every timestep of the model. I.e. When
> running the model for the first time, 1 is used, running it for the
> next time the total sales value of the first timestep is used. Since only
> one timestep exists, this is logical. Yet  I would expect the movavg
> function to produce a new value in each of the 120 timesteps, as it is the
> case with all other flow, stock and aux calculations as well.
>
> It would be great if you could help me with fixing this problem.
>
>
> Many thanks in advance!
>
> Yours,
>
> Jan
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] DeSolve Package and Moving Average

2017-11-29 Thread Werning, Jan-Philipp
Dear all,


I am using the DeSolve Package to simulate a system dynamics model. At the 
problematic point in the model, I basically want to decide how many products 
shall be produced to be sold. In order to determine the amount a basic 
forecasting model of using the average of the last 12 time periods shall be 
used. My code looks like the following.

“ […]

# Time units in month
START<-0; FINISH<-120; STEP<-1

# Set seed for reproducability

 set.seed(123)

# Create time vector
simtime  <- seq(START, FINISH, by=STEP)

# Create a stock vector with initial values
stocks   <- c([…])

# Create an aux vector for the fixed aux values
auxs<- c([…])


model <- function(time, stocks, auxs){
  with(as.list(c(stocks, auxs)),{

[… “lots of aux, flow, and stock functions” … ]


aMovingAverage  <-  
ifelse(exists("ResultsSimulation")=="FALSE",1,movavg(ResultsSimulation$TotalSales,
 12, type = "s”))


return (list(c([…]))

  })
}

# Call Solver, and store results in a data frame
ResultsSimulation <-  data.frame(ode(y=stocks, times=simtime, func = model,
  parms=auxs, method="euler"))

[…]”

My problem is, that the moving average (function: movavg) is only computed once 
and the same value is used in every timestep of the model. I.e. When running 
the model for the first time, 1 is used, running it for the next time the 
total sales value of the first timestep is used. Since only one timestep 
exists, this is logical. Yet  I would expect the movavg function to produce a 
new value in each of the 120 timesteps, as it is the case with all other flow, 
stock and aux calculations as well.

It would be great if you could help me with fixing this problem.


Many thanks in advance!

Yours,

Jan





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R-es] Búsqueda de palabras en una variable de R

2017-11-29 Thread Carlos J. Gil Bellosta
readLines()

El mié., 29 nov. 2017 5:51,  escribió:

> Muchas gracias,
>
> Estoy intentado ejecutar el paquete y necesito importar el archivo txt,
> pero necesito importarlo de modo que cada línea sea una observación y no
> un texto único (tengo unas 63,000 lineas). No encuentro la solución en los
> enlaces. ¿Sabrías como hacerlo?
>
> Gracias!
> El Mar, 28 de Noviembre de 2017, 3:50, Freddy Omar López Quintero escribió:
> > El mar, 28-11-2017 a las 03:42 +0100, miriam.alz...@unavarra.es
> > escribió:
> >> Tengo un vector de 40 palabras (marca) y necesito saber si en una de
> >> las
> >> variables del data.frame (datos) se incluye alguna de esas 40
> >> palabras. Si
> >> se incluye alguna de ellas, me gustaría crear una variable dummy
> >> siendo 1
> >> que incluye alguna palabra y 0 que no incluye.
> >>
> >> ¿Qué paquete me recomendáis? ¿Cuál sería el comando a ejecutar?
> >
> > Lo que describes luce como minería de texto y lo que parece que quieres
> > es una porción de la matriz que llaman Term-Document Matrix. El paquete
> > por excelencia para estos menesteres es tm:
> >
> > https://cran.r-project.org/web/packages/tm/
> >
> > que tiene su buena viñeta
> >
> > https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
> >
> > Ojalá sirva.
> >
> > Saludos.
> >
> >
> > --
> > «...homines autem hominum causa esse generatos...»
> >
> > Cicero
>
> ___
> R-help-es mailing list
> R-help-es@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-help-es
>

[[alternative HTML version deleted]]

___
R-help-es mailing list
R-help-es@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-help-es


Re: [R] How to extract coefficients from sequential (type 1), ANOVAs using lmer and lme

2017-11-29 Thread Phillip Alday
(This time with the r-help in the recipients...)

Be careful when mixing lme4 and lmerTest together -- lmerTest extends
and changes the behavior of various lme4 functions.

From the help page for lme4-anova (?lme4::anova.merMod)

>  ‘anova’: returns the sequential decomposition of the contributions
>   of fixed-effects terms or, for multiple arguments, model
>   comparison statistics.  For objects of class ‘lmerMod’ the
>   default behavior is to refit the models with ML if fitted
>   with ‘REML = TRUE’, this can be controlled via the ‘refit’
>   argument. See also ‘anova’.

So lme4-anova will give you sequential tests; note, however, that lme4
won't calculate the denominator degrees of freedom for you and thus
won't give p-values. See the FAQ
(https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-are-p_002dvalues-not-displayed-when-using-lmer_0028_0029_003f)

From the help page for lmerTest-anova (?lmerTest::anova.merModLmerTest):
> Usage:
> 
>  ## S4 method for signature 'merModLmerTest'
>  anova(object, ... , ddf="Satterthwaite", 
>  type=3)
>  
> Arguments:
> 
...
> type: type of hypothesis to be tested. Could be type=3 or type=2 or
>   type = 1 (The definition comes from SAS theory)


So lmerTest-anova by default gives you Type III ('marginal', although
Type II is what actually gives you tests that respect the Principle of
Marginality; see John Fox's Applied Regression Analysis (book) or
Venables' "Exegeses on Linear Models"
(https://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf) for more information
on that. Type I tests are the sequential tests, so with anova(model,
type=1), you will get the sequential tests you want. lmerTest will
approximate the denominator degrees of freedom for you (using
Satterthwaite method by default, or the more computationally intensive
Kenward-Roger method), so you'll get p-values if that's what you want.

Finally, it's important to note two things:

1. The "type"-argument for nlme::summary doesn't actually do anything
(see ?nlme::summary.lme). It's just passed onto the 'print' method,
where it's silently ignored. The 'type' of sum of squares is an
ANOVA-thing; the closest correspondence in terms of model coefficients
is the coding of your categorical contrasts. See the literature
mentioned above for more details as well as Dale Barr's discussion on
simple vs. main effects in regression models
(http://talklab.psy.gla.ac.uk/tvw/catpred/).

(?nlme::anova.lme does have indeed have a 'type' argument.)

2. It is possible for the sequential tests and the marginal tests to
yield the same results. Again, see the above literature. You have no
interactions in your model and continuous (i.e. not-categorical)
predictors, so if they're orthogonal, then the sequential and marginal
tests will be numerically the same, even if they test different
hypotheses. (See section 5.2, starting on page 14; the sequential tests
are the "eliminating" tests, while the marginal tests are the "ignoring"
tests in that explanation.)

Best,
Phillip


On 28/11/17 12:00, r-help-requ...@r-project.org wrote:
> I wantto run sequential ANOVAs (i.e. type I sums of squares), and trying to 
> getresults including ANOVA tables and associated coefficients for predictive 
> variables(I am using the R 3.4.2 version). I think ANOVA tables look right, 
> but believecoefficients are wrong. Specifically, it looks like that the 
> coefficients arefrom ANOVA with ?marginal? (type III sums of squares). I have 
> tried both lme (nlmepackage) and lmer (lme4 + lmerTEST packages). Examples of 
> the results arebelow:
> 





> Ibelieve the results from summary() are for ?marginal? instead of 
> ?sequential?ANOVA because the p-value (i.e., 0.237 for narea) in summary are 
> identical tothose in tables from ?marginal?. I also used lmer in the lme4 
> pacakge to findthe same results (summary() results look like from ?marginal?).
> 
> 
> Cananybody tell me how to get coefficients for ?sequential? ANOVAs? Thank you.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.