Re: [Rd] Bug: time complexity of substring is quadratic

2019-02-23 Thread Radford Neal
> From: Tomas Kalibera 
> 
> Thanks for the report, I am working on a patch that will address this.
> 
> I confirm there is a lot of potential for speedup. On my system,
> 
> 'N=20; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
> 
> spends 96% time in checking if the string is ascii and 3% in strlen(); 
> if we take advantage of the pre-computed value in the ASCII bit, the 
> speed up is about 40x.


The latest version of pqR (at pqR-project.org) has changes that
considerably speed up both this and other string operations.

Here's a test (both compiled with gcc 8.2.0 with -O3 on a Skylake processor).

R-3.5.2:

> N=20; system.time(for (i in 1:100) r<-paste(rep("A",N),collapse=""))
   user  system elapsed 
  1.548   0.023   1.572 
> system.time(for (i in 1:10) x<-substring(r,1:N,1:N))
   user  system elapsed 
  4.462   0.016   4.478 

pqR-2019-02-19:

> N=20; system.time(for (i in 1:100) r<-paste(rep("A",N),collapse=""))
   user  system elapsed 
  0.318   0.071   0.388 
> system.time(for (i in 1:10) x<-substring(r,1:N,1:N))
   user  system elapsed 
  0.041   0.000   0.041 

Some of this may be due to pqR's faster garbage collector - R Core
implementatons have a particular GC problem with strings, as explained at
https://radfordneal.wordpress.com/2018/11/29/faster-garbage-collection-in-pqr/

But there are also some specific improvements to string operations that
you might want to have a look at.

Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model.matrix.default() silently ignores bad contrasts.arg

2019-02-23 Thread Ben Bolker


 thanks!

On 2019-02-23 5:42 a.m., Martin Maechler wrote:
>> Fox, John 
>> on Fri, 22 Feb 2019 17:40:15 + writes:
> 
> > Dear Martin and Ben, I agree that a warning is a good idea
> > (and perhaps that wasn't clear in my response to Ben's
> > post).
> 
> > Also, it would be nice to correct the omission in the help
> > file, which as far as I could see doesn't mention that a
> > contrast-generating function (as opposed to its quoted
> > name) can be an element of the contrasts.arg list.
> 
> > Best, John
> 
> Thank you John for the clarification and the reminder about
> filling the omission there!
> 
> Prepared to go (into the sources) now.
> Martin
> 
> 
> >> -Original Message- From: Martin Maechler
> >> [mailto:maech...@stat.math.ethz.ch] Sent: Friday,
> >> February 22, 2019 11:50 AM To: Ben Bolker
> >>  Cc: Fox, John ;
> >> r-devel@r-project.org Subject: Re: [Rd]
> >> model.matrix.default() silently ignores bad contrasts.arg
> >> 
> >> > Ben Bolker > on Thu, 21 Feb 2019 08:18:51 -0500
> >> writes:
> >> 
> >> > On Thu, Feb 21, 2019 at 7:49 AM Fox, John
> >>  wrote:
> >> >>
> >> >> Dear Ben,
> >> >>
> >> >> Perhaps I'm missing the point, but contrasts.arg is
> >> documented to be a list. From ?model.matrix:
> >> "contrasts.arg: A list, whose entries are values (numeric
> >> matrices or character strings naming functions) to be
> >> used as replacement values for the contrasts replacement
> >> function and whose names are the names of columns of data
> >> containing factors."
> >> 
> >> > I absolutely agree that this is not a bug/behaves as
> >> documented (I > could have said that more clearly).  It's
> >> just that (for reasons I > attempted to explain) this is
> >> a really easy mistake to make.
> >> 
> >> >> This isn't entirely accurate because a function also
> >> works as a named element of the list (in addition to a
> >> character string naming a function and a contrast
> >> matrix), as your example demonstrates, but nowhere that
> >> I'm aware of is it suggested that a non-list should work.
> >> >>
> >> >> It certainly would be an improvement if specifying
> >> contrast.arg as a non- list generated an error or warning
> >> message, and it at least arguably would be convenient to
> >> allow a general contrast specification such as
> >> contrasts.arg- "contr.sum", but I don't see a bug here.
> >> 
> >> > I agree.  That's what my patch does (throws a warning
> >> message if > contrasts.arg is non-NULL and not a list).
> >> 
> >> I currently do think this is a good idea... "even though"
> >> I'm 99% sure that this will make work for package
> >> maintainers and others whose code may suddenly show
> >> warnings.  I hope they would know better than
> >> suppressWarnings(.) ...
> >> 
> >> I see a version of the patch using old style indentation
> >> which makes the diff even "considerably" smaller -- no
> >> need to submit this different, though -- and I plan to
> >> test that a bit, and commit eventually to R-devel,
> >> possibly in a 5 days or so.
> >> 
> >> Thank you Ben for the suggestion and patch !  Martin
> >> 
> >> > cheers > Ben Bolker
> >> 
> >> >> Best, >> John
> >> >>
> >> >> -
> >> >> John Fox, Professor Emeritus >> McMaster University >>
> >> Hamilton, Ontario, Canada >> Web:
> >> http::/socserv.mcmaster.ca/jfox
> >> >>
> >> >> > On Feb 20, 2019, at 7:14 PM, Ben Bolker
> >>  wrote:
> >> >> >
> >> >> > An lme4 user pointed out
> >>  that >> >
> >> passing contrasts as a string or symbol to [g]lmer (which
> >> would work if >> > we were using `contrasts<-` to set
> >> contrasts on a factor variable) is >> > *silently
> >> ignored*. This goes back to model.matrix(), and seems bad
> >> >> > (this is a very easy mistake to make, because of the
> >> multitude of ways >> > to specify contrasts for factors
> >> in R - e.g. options(contrasts=...); >> > setting
> >> contrasts on the specific factors; passing contrasts as a
> >> list >> > to the model function ... )
> >> >> >
> >> >> > The relevant code is here:
> >> >> >
> >> >> > https://github.com/wch/r-
> >> source/blob/trunk/src/library/stats/R/models.R#L578-L603
> >> >> >
> >> >> > The following code shows the problem: a
> >> plain-vanilla model.matrix() >> > call with no contrasts
> >> argument, followed by two wrong contrasts >> > arguments,
> >> followed by a correct contrasts argument.
> >> >> >
> >> >> > data(cbpp, package="lme4") >> > mf1 <-
> >> model.matrix(~period, data=cbpp) >> > mf2 <-
> >> model.matrix(~period, con

Re: [Rd] Return/print standard error in t.test()

2019-02-23 Thread Thomas J. Leeper
That seems great to me. Thank you very much!

-Thomas

On Sat, Feb 23, 2019 at 11:14 AM Martin Maechler
 wrote:
>
> > peter dalgaard
> > on Fri, 22 Feb 2019 12:38:14 +0100 writes:
>
> > It's not a problem per se to put additional information
> > into class htest objects (hey, it's S3 after all...) and
> > there is a precedent in chisq.test which returns $observed
> > and $expected.
>
> It seems the consent is to simply return the SE but *not* change
> the print() method, and also be careful not to mess with
> existing parts of the result.
> So, a minimal patch is to add the short line
>
>   stderr = stderr,
>
> inside the list(..) constucting the return value...
>
> and that's what I'm planning to commit (to the sources).
>
> With thanks for the suggestion and considerations to
> Thomas, John and Peter!
>
> Martin
>
> > Getting such information printed by print.htest is more tricky, 
> although it might be possible to (ab)use the $estimate slot.
>
> > The further question is whether one would really want to do that 
> (change the output and/or modify the current return values), at the risk of 
> affecting a rather large bundle of existing scripts, books, lecture notes, 
> etc. I don't think that I would want to do that for the case of the s.e.d., 
> although I'll admit that there is another thing that has always been a bit of 
> an eyesore to me: We give a confidence interval but not the corresponding 
> point estimate (i.e. the _difference_ of the means).
>
> > It might be better to simply start over and write a new function. In 
> the process one might address other things that people have been asking for, 
> like calculations based on the sample mean and SDs (which would useful for 
> dealing with published summaries and textbook examples). Oh, and a formula 
> interface for the one-sample test.
>
> > -pd
>
> >> On 21 Feb 2019, at 22:51 , Fox, John  wrote:
> >>
> >> Dear Thomas,
> >>
> >> it is, unfortunately, not that simple. t.test() returns an object of 
> class "htest" and not all such objects have standard errors. I'm not entirely 
> sure what the point is since it's easy to compute the standard error of the 
> difference from the information in the object (adapting an example from 
> ?t.test):
> >>
> >>> (res <- t.test(1:10, y = c(7:20)))
> >>
> >> Welch Two Sample t-test
> >>
> >> data:  1:10 and c(7:20)
> >> t = -5.4349, df = 21.982, p-value = 1.855e-05
> >> alternative hypothesis: true difference in means is not equal to 0
> >> 95 percent confidence interval:
> >> -11.052802  -4.947198
> >> sample estimates:
> >> mean of x mean of y
> >> 5.5  13.5
> >>
> >>> as.vector(abs(diff(res$estimate)/res$statistic)) # SE
> >> [1] 1.47196
> >>> class(res)
> >> [1] "htest"
> >>
> >> and if you really want to print the SE as a matter of course, you 
> could always write your own wrapper for t.test() that returns an object of 
> class, say, "t.test" for which you can provide a print() method. Much of the 
> advantage of working in a statistical computing environment like R (or Stata, 
> for that matter) is that you can make things work the way you like.
> >>
> >> Best,
> >> John
> >>
> >> -
> >> John Fox, Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: http::/socserv.mcmaster.ca/jfox
> >>
> >>> On Feb 21, 2019, at 3:57 PM, Thomas J. Leeper  
> wrote:
> >>>
> >>> A recent thread on Twitter [1] by a Stata user highlighted that 
> t.test()
> >>> does not return or print the standard error of the mean difference, 
> despite
> >>> it being calculated by the function.
> >>>
> >>> I know this isn’t the kind of change that’s likely to be made but 
> could we
> >>> at least return the SE even if the print() method isn’t updated? Or,
> >>> better, update the print() method to display this as well?
> >>>
> >>> Best,
> >>> Thomas
> >>>
> >>> [1]
> >>> https://twitter.com/amandayagan/status/1098314654470819840?s=21
> >>> --
> >>>
> >>> Thomas J. Leeper
> >>> http://www.thomasleeper.com
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-devel@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >> __
> >> R-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd@cbs.dk  Priv: pda...@gmail.com
>
> > __

Re: [Rd] Return/print standard error in t.test()

2019-02-23 Thread Martin Maechler
> peter dalgaard 
> on Fri, 22 Feb 2019 12:38:14 +0100 writes:

> It's not a problem per se to put additional information
> into class htest objects (hey, it's S3 after all...) and
> there is a precedent in chisq.test which returns $observed
> and $expected.

It seems the consent is to simply return the SE but *not* change
the print() method, and also be careful not to mess with
existing parts of the result.
So, a minimal patch is to add the short line

  stderr = stderr,

inside the list(..) constucting the return value...

and that's what I'm planning to commit (to the sources).

With thanks for the suggestion and considerations to
Thomas, John and Peter!

Martin

> Getting such information printed by print.htest is more tricky, although 
it might be possible to (ab)use the $estimate slot. 

> The further question is whether one would really want to do that (change 
the output and/or modify the current return values), at the risk of affecting a 
rather large bundle of existing scripts, books, lecture notes, etc. I don't 
think that I would want to do that for the case of the s.e.d., although I'll 
admit that there is another thing that has always been a bit of an eyesore to 
me: We give a confidence interval but not the corresponding point estimate 
(i.e. the _difference_ of the means).

> It might be better to simply start over and write a new function. In the 
process one might address other things that people have been asking for, like 
calculations based on the sample mean and SDs (which would useful for dealing 
with published summaries and textbook examples). Oh, and a formula interface 
for the one-sample test.

> -pd

>> On 21 Feb 2019, at 22:51 , Fox, John  wrote:
>> 
>> Dear Thomas,
>> 
>> it is, unfortunately, not that simple. t.test() returns an object of 
class "htest" and not all such objects have standard errors. I'm not entirely 
sure what the point is since it's easy to compute the standard error of the 
difference from the information in the object (adapting an example from 
?t.test):
>> 
>>> (res <- t.test(1:10, y = c(7:20)))
>> 
>> Welch Two Sample t-test
>> 
>> data:  1:10 and c(7:20)
>> t = -5.4349, df = 21.982, p-value = 1.855e-05
>> alternative hypothesis: true difference in means is not equal to 0
>> 95 percent confidence interval:
>> -11.052802  -4.947198
>> sample estimates:
>> mean of x mean of y 
>> 5.5  13.5 
>> 
>>> as.vector(abs(diff(res$estimate)/res$statistic)) # SE
>> [1] 1.47196
>>> class(res)
>> [1] "htest"
>> 
>> and if you really want to print the SE as a matter of course, you could 
always write your own wrapper for t.test() that returns an object of class, 
say, "t.test" for which you can provide a print() method. Much of the advantage 
of working in a statistical computing environment like R (or Stata, for that 
matter) is that you can make things work the way you like.
>> 
>> Best,
>> John
>> 
>> -
>> John Fox, Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> Web: http::/socserv.mcmaster.ca/jfox
>> 
>>> On Feb 21, 2019, at 3:57 PM, Thomas J. Leeper  
wrote:
>>> 
>>> A recent thread on Twitter [1] by a Stata user highlighted that t.test()
>>> does not return or print the standard error of the mean difference, 
despite
>>> it being calculated by the function.
>>> 
>>> I know this isn’t the kind of change that’s likely to be made but could 
we
>>> at least return the SE even if the print() method isn’t updated? Or,
>>> better, update the print() method to display this as well?
>>> 
>>> Best,
>>> Thomas
>>> 
>>> [1]
>>> https://twitter.com/amandayagan/status/1098314654470819840?s=21
>>> -- 
>>> 
>>> Thomas J. Leeper
>>> http://www.thomasleeper.com
>>> 
>>> [[alternative HTML version deleted]]
>>> 
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd@cbs.dk  Priv: pda...@gmail.com

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model.matrix.default() silently ignores bad contrasts.arg

2019-02-23 Thread Martin Maechler
> Fox, John 
> on Fri, 22 Feb 2019 17:40:15 + writes:

> Dear Martin and Ben, I agree that a warning is a good idea
> (and perhaps that wasn't clear in my response to Ben's
> post).

> Also, it would be nice to correct the omission in the help
> file, which as far as I could see doesn't mention that a
> contrast-generating function (as opposed to its quoted
> name) can be an element of the contrasts.arg list.

> Best, John

Thank you John for the clarification and the reminder about
filling the omission there!

Prepared to go (into the sources) now.
Martin


>> -Original Message- From: Martin Maechler
>> [mailto:maech...@stat.math.ethz.ch] Sent: Friday,
>> February 22, 2019 11:50 AM To: Ben Bolker
>>  Cc: Fox, John ;
>> r-devel@r-project.org Subject: Re: [Rd]
>> model.matrix.default() silently ignores bad contrasts.arg
>> 
>> > Ben Bolker > on Thu, 21 Feb 2019 08:18:51 -0500
>> writes:
>> 
>> > On Thu, Feb 21, 2019 at 7:49 AM Fox, John
>>  wrote:
>> >>
>> >> Dear Ben,
>> >>
>> >> Perhaps I'm missing the point, but contrasts.arg is
>> documented to be a list. From ?model.matrix:
>> "contrasts.arg: A list, whose entries are values (numeric
>> matrices or character strings naming functions) to be
>> used as replacement values for the contrasts replacement
>> function and whose names are the names of columns of data
>> containing factors."
>> 
>> > I absolutely agree that this is not a bug/behaves as
>> documented (I > could have said that more clearly).  It's
>> just that (for reasons I > attempted to explain) this is
>> a really easy mistake to make.
>> 
>> >> This isn't entirely accurate because a function also
>> works as a named element of the list (in addition to a
>> character string naming a function and a contrast
>> matrix), as your example demonstrates, but nowhere that
>> I'm aware of is it suggested that a non-list should work.
>> >>
>> >> It certainly would be an improvement if specifying
>> contrast.arg as a non- list generated an error or warning
>> message, and it at least arguably would be convenient to
>> allow a general contrast specification such as
>> contrasts.arg- "contr.sum", but I don't see a bug here.
>> 
>> > I agree.  That's what my patch does (throws a warning
>> message if > contrasts.arg is non-NULL and not a list).
>> 
>> I currently do think this is a good idea... "even though"
>> I'm 99% sure that this will make work for package
>> maintainers and others whose code may suddenly show
>> warnings.  I hope they would know better than
>> suppressWarnings(.) ...
>> 
>> I see a version of the patch using old style indentation
>> which makes the diff even "considerably" smaller -- no
>> need to submit this different, though -- and I plan to
>> test that a bit, and commit eventually to R-devel,
>> possibly in a 5 days or so.
>> 
>> Thank you Ben for the suggestion and patch !  Martin
>> 
>> > cheers > Ben Bolker
>> 
>> >> Best, >> John
>> >>
>> >> -
>> >> John Fox, Professor Emeritus >> McMaster University >>
>> Hamilton, Ontario, Canada >> Web:
>> http::/socserv.mcmaster.ca/jfox
>> >>
>> >> > On Feb 20, 2019, at 7:14 PM, Ben Bolker
>>  wrote:
>> >> >
>> >> > An lme4 user pointed out
>>  that >> >
>> passing contrasts as a string or symbol to [g]lmer (which
>> would work if >> > we were using `contrasts<-` to set
>> contrasts on a factor variable) is >> > *silently
>> ignored*. This goes back to model.matrix(), and seems bad
>> >> > (this is a very easy mistake to make, because of the
>> multitude of ways >> > to specify contrasts for factors
>> in R - e.g. options(contrasts=...); >> > setting
>> contrasts on the specific factors; passing contrasts as a
>> list >> > to the model function ... )
>> >> >
>> >> > The relevant code is here:
>> >> >
>> >> > https://github.com/wch/r-
>> source/blob/trunk/src/library/stats/R/models.R#L578-L603
>> >> >
>> >> > The following code shows the problem: a
>> plain-vanilla model.matrix() >> > call with no contrasts
>> argument, followed by two wrong contrasts >> > arguments,
>> followed by a correct contrasts argument.
>> >> >
>> >> > data(cbpp, package="lme4") >> > mf1 <-
>> model.matrix(~period, data=cbpp) >> > mf2 <-
>> model.matrix(~period, contrasts.arg="contr.sum",
>> data=cbpp) >> > all.equal(mf1,mf2) ## TRUE >> > mf3 <-
>> model.matrix(~period, contrasts.arg=contr.sum, data=cbpp)
>> >> > all.equal(mf1,mf3) ## TRUE >> > mf4 <-
>> model.matrix(~period,
>> contrasts.arg=list(period=contr.sum), >> > data=cb