Re: [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

2017-06-07 Thread Hervé Pagès

Hi Martin,

On 06/07/2017 03:54 AM, Martin Maechler wrote:

Martin Maechler 
 on Tue, 6 Jun 2017 09:45:44 +0200 writes:



Hervé Pagès 
 on Fri, 2 Jun 2017 04:05:15 -0700 writes:


 >> Hi, I have a long numeric vector 'xx' and I want to use
 >> sum() to count the number of elements that satisfy some
 >> criteria like non-zero values or values lower than a
 >> certain threshold etc...

 >> The problem is: sum() returns an NA (with a warning) if
 >> the count is greater than 2^31. For example:

 >>> xx <- runif(3e9) sum(xx < 0.9)
 >> [1] NA Warning message: In sum(xx < 0.9) : integer
 >> overflow - use sum(as.numeric(.))

 >> This already takes a long time and doing
 >> sum(as.numeric(.)) would take even longer and require
 >> allocation of 24Gb of memory just to store an
 >> intermediate numeric vector made of 0s and 1s. Plus,
 >> having to do sum(as.numeric(.)) every time I need to
 >> count things is not convenient and is easy to forget.

 >> It seems that sum() on a logical vector could be modified
 >> to return the count as a double when it cannot be
 >> represented as an integer.  Note that length() already
 >> does this so that wouldn't create a precedent. Also and
 >> FWIW prod() avoids the problem by always returning a
 >> double, whatever the type of the input is (except on a
 >> complex vector).

 >> I can provide a patch if this change sounds reasonable.

 > This sounds very reasonable, thank you Hervé, for the
 > report, and even more for a (small) patch.

I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.

For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.)  and that is
not something we have done / wanted to do in the past...  Speed
being one reason.

So this ends up being more delicate than I had thought at first,
because changing  sum()  only would mean that

   sum(LOGI)  and
   sum(as.integer(LOGI))

would start differ for a logical vector LOGI.

So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.

I'm sorry for having raised possibly unrealistic expectations.


No worries. Thanks for taking my proposal into consideration.
Note that the isum() function in src/main/summary.c is already using
a 64-bit accumulator to accommodate intermediate sums > INT_MAX.
So it should be easy to modify the function to make it overflow for
much bigger final sums without altering performance. Seems like
R_XLEN_T_MAX would be the natural threshold.

Cheers,
H.



Martin

 > Martin

 >> Cheers, H.

 >> --
 >> Hervé Pagès

 >> Program in Computational Biology Division of Public
 >> Health Sciences Fred Hutchinson Cancer Research Center
 >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
 >> 98109-1024

 >> E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:
 >> (206) 667-1319

 >> __
 >> R-devel@r-project.org mailing list
 >> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel=DwIDAw=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y=

 > __
 > R-devel@r-project.org mailing list
 > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel=DwIDAw=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y=



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Bioc-devel] How to re-check the submitted package?

2017-06-07 Thread Shepherd, Lori
What package are you referring to?


If you updated the Description and a new build did not occur make sure you have 
set up the webhook:

https://github.com/Bioconductor/Contributions#adding-a-web-hook




Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


From: Bioc-devel  on behalf of Hu, Zicheng 

Sent: Wednesday, June 7, 2017 2:23:23 PM
To: bioc-devel@r-project.org
Subject: [Bioc-devel] How to re-check the submitted package?

Hi, All,

This is my first time contributing packages to Bioconductor.  After I submitted 
the package, the automated check found an error, so I updated my package on 
github and updated the version number in the description files as well. A week 
passed by and nothing happens.  How do I request a new check for my package?

Thanks
Zicheng

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] How to re-check the submitted package?

2017-06-07 Thread Martin Morgan

On 06/07/2017 02:23 PM, Hu, Zicheng wrote:

Hi, All,

This is my first time contributing packages to Bioconductor.  After I submitted 
the package, the automated check found an error, so I updated my package on 
github and updated the version number in the description files as well. A week 
passed by and nothing happens.  How do I request a new check for my package?


What is your package?



Thanks
Zicheng

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




This email message may contain legally privileged and/or...{{dropped:2}}

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] How to re-check the submitted package?

2017-06-07 Thread Hu, Zicheng
Hi, All,

This is my first time contributing packages to Bioconductor.  After I submitted 
the package, the automated check found an error, so I updated my package on 
github and updated the version number in the description files as well. A week 
passed by and nothing happens.  How do I request a new check for my package?

Thanks
Zicheng

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] ensembl 89 gtfs and fasta twobits have been added to AnnotationHub

2017-06-07 Thread Shepherd, Lori

Hello all,

The ensembl 89 gtf (converted to GRanges on the fly) and fasta (twobit files) 
been added to AnnotationHub and are currently available in Bioc3.6 
(development) with AnnotationHub (<= 2.9.0):



> library(AnnotationHub)
> hub = AnnotationHub()
updating metadata: retrieving 1 resource
  |==| 100%
snapshotDate(): 2017-06-07
> length(query(hub, c("ensembl", "gtf", "release-89")))
[1] 215

> length(query(hub, c("fasta", "release-89", "twobit")))
[1] 353

Thank you


Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

2017-06-07 Thread Martin Maechler
> Martin Maechler 
> on Tue, 6 Jun 2017 09:45:44 +0200 writes:

> Hervé Pagès 
> on Fri, 2 Jun 2017 04:05:15 -0700 writes:

>> Hi, I have a long numeric vector 'xx' and I want to use
>> sum() to count the number of elements that satisfy some
>> criteria like non-zero values or values lower than a
>> certain threshold etc...

>> The problem is: sum() returns an NA (with a warning) if
>> the count is greater than 2^31. For example:

>>> xx <- runif(3e9) sum(xx < 0.9)
>> [1] NA Warning message: In sum(xx < 0.9) : integer
>> overflow - use sum(as.numeric(.))

>> This already takes a long time and doing
>> sum(as.numeric(.)) would take even longer and require
>> allocation of 24Gb of memory just to store an
>> intermediate numeric vector made of 0s and 1s. Plus,
>> having to do sum(as.numeric(.)) every time I need to
>> count things is not convenient and is easy to forget.

>> It seems that sum() on a logical vector could be modified
>> to return the count as a double when it cannot be
>> represented as an integer.  Note that length() already
>> does this so that wouldn't create a precedent. Also and
>> FWIW prod() avoids the problem by always returning a
>> double, whatever the type of the input is (except on a
>> complex vector).

>> I can provide a patch if this change sounds reasonable.

> This sounds very reasonable, thank you Hervé, for the
> report, and even more for a (small) patch.

I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.

For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.)  and that is
not something we have done / wanted to do in the past...  Speed
being one reason.

So this ends up being more delicate than I had thought at first,
because changing  sum()  only would mean that

  sum(LOGI)   and
  sum(as.integer(LOGI))

would start differ for a logical vector LOGI.

So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.

I'm sorry for having raised possibly unrealistic expectations.
Martin

> Martin

>> Cheers, H.

>> -- 
>> Hervé Pagès

>> Program in Computational Biology Division of Public
>> Health Sciences Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>> 98109-1024

>> E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:
>> (206) 667-1319

>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Philosophy behind converting Fortran to C for use in R

2017-06-07 Thread Martyn Byng
Hi,

Just a quick comment on (1).

The C-Fortran interface has been standardized since Fortran 2003.  However, it 
does require the Fortran interface that is being called from C  to have been 
written with C operability in mind as specific C interoperable types etc. must 
be used.

Trying to call a Fortran interface that hasn't been written using C 
interoperable types still suffers from the issues that Bill describes.
 
Martyn

-Original Message-
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of William 
Dunlap via R-devel
Sent: 06 June 2017 22:34
To: Avraham Adler 
Cc: R-devel 
Subject: Re: [Rd] Philosophy behind converting Fortran to C for use in R

Here are three reasons for converting Fortran code, especially older
Fortran code, to C:

1. The C-Fortran interface is not standardized.  Various Fortran compilers
pass logical and character arguments in various ways.  Various Fortran
compilers mangle function and common block names in variousl ways.  You can
avoid that problem by restricting R to using a certain Fortran compiler,
but that can make porting R to a new platform difficult.

2. By default, variables in Fortran routines are not allocated on the
stack, but are statically allocated, making recursion hard.

3. New CS graduates tend not to know Fortran.

(There are good reasons for not translating as well, risk and time being
the main ones.)


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Jun 6, 2017 at 1:27 PM, Avraham Adler 
wrote:

> Hello.
>
> This is not a question about a bug or even best practices; rather I'm
> trying to understand the philosophy or theory as to why certain
> portions of the R codebase are written as they are. If this question
> is better posed elsewhere, please point me in the proper direction.
>
> In the thread about the issues with the Tukey line, Martin said [1]:
>
> > when this topic came up last (for me) in Dec. 2014, I did spend about 2
> days work (or more?)
> > to get the FORTRAN code from the 1981 - book (which is abbreviated the
> "ABC of EDA")
> > from a somewhat useful OCR scan into compilable Fortran code and then
> f2c'ed,
> > wrote an R interface function found problems…
>
> I have seen this in the R source code and elsewhere, that native
> Fortran is converted to C via f2c and then run as C within R. This is
> notwithstanding R's ability to use Fortran, either directly through
> .Fortran() [2] or via .Call() using simple helper C-wrappers [3].
>
> I'm curious as to the reason. Is it because much of the code was
> written before Fortran 90 compilers were freely available? Does it
> help with maintenance or make debugging easier? Is it faster or more
> likely to compile cleanly?
>
> Thank you,
>
> Avi
>
> [1] https://stat.ethz.ch/pipermail/r-devel/2017-May/074363.html
> [2] Such as kmeans does for the Hartigan-Wong method in the stats package
> [2] Such as the mvtnorm package does
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


This e-mail has been scanned for all viruses by Star.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel