Re: [Rd] R-devel internal errors during check produce?

2020-06-29 Thread Kurt Hornik
> Jan Gorecki writes:

> Thank you both,
> You are absolutely correct that example should be minimal, so here it is.

> l = list(a=new.env(), b=new.env())
> unique(l)

> Just for completeness, env_list during check that raises error

> env_list <- list(baseenv(),
>   as.environment("package:graphics"),
>   as.environment("package:stats"),
>   as.environment("package:utils"),
>   as.environment("package:methods")
> )
> unique(env_list)

Thanks ... but the above work fine for me.  E.g., 

R> l = list(a=new.env(), b=new.env())
R> unique(l)
[[1]]


[[2]]


Best
-k


> Best regards,
> Jan

> On Mon, Jun 29, 2020 at 5:42 PM Martin Maechler
>  wrote:
>> 
>> > Kurt Hornik
>> > on Mon, 29 Jun 2020 16:13:03 +0200 writes:
>> 
>> > Jan Gorecki writes:
>> >> So the unique.default is from the R tools package during
>> >> checks.  I don't see those issues on CRAN checks.
>> 
>> > I cannot reproduce this locally (and have no clues about
>> > docker).  Perhaps you can try to debug this on your end?
>> > And see what env_list is when the error occurs?
>> 
>> > Best -k
>> 
>> Indeed, if it is a bug in R (as opposed to being an assumption
>> that 'data.table' makes about undocumented R internals), it
>> should be reproducible with a very small dummy package instead
>> of data.table. ... or actually reproducible with relatively
>> simple R code calling unique() not envolving any non base package.
>> 
>> Martin
>> 
>> 
>> >> Exact environment where I am reproducing this issue is a
>> >> fresh ubuntu, no R packages pre-installed docker pull
>> >> registry.gitlab.com/jangorecki/dockerfiles/r-devel
>> >> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile
>> 
>> >> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki
>> >>  wrote:
>> >>>
>> >>> Hi R developers,
>> >>>
>> >>> On R-devel (2020-06-24 r78746) I am getting those two
>> >>> new exceptions during R check. I found a change which
>> >>> eventually may be related
>> >>> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465
>> >>> I think this may be a regression. I grep'ed package
>> >>> manuals and R code for unique.default but don't see
>> >>> any. Usage section of the unique method looks fine as
>> >>> well. Errors look a little bit like internal errors.
>> >>>
>> >>> * checking Rd \usage sections ... NOTE Error in
>> >>> unique.default(env_list) : LENGTH or similar applied to
>> >>> environment object Calls: 
>> >>> ... .get_S3_generics_as_seen_from_package -> unique ->
>> >>> unique.default Execution halted The \usage entries for
>> >>> S3 methods should use the \method markup and not their
>> >>> full name.  * checking S3 generic/method consistency
>> >>> ... WARNING Error in unique.default(env_list) : LENGTH
>> >>> or similar applied to environment object Calls:
>> >>>  ... .get_S3_generics_as_seen_from_package ->
>> >>> unique -> unique.default
>> >>>
>> >>> I don't think if it is related but I build R-devel with
>> >>> extra args: --with-recommended-packages
>> >>> --enable-strict-barrier --disable-long-double I check
>> >>> with: --as-cran --no-manual To reproduce download
>> >>> current data.table from CRAN (1.12.8) and run R check
>> >>>
>> >>> Best regards, Jan Gorecki
>> 
>> >> __
>> >> R-devel@r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> > __
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] Possible ABI change in R 4.0.1

2020-06-29 Thread luke-tierney

EXTPTR_PTR is not in the API so it is not guaranteed to even exist in
the future. The API function for accessing the pointer address is
R_ExternalPtrAddr. See Section 5.13 in WRE.

Sometimes internals need to be changed, In this case a change was made
to deal with a segfault; the commit notice tells you the PR this
addressed.

As it says in Writing R Extensions about defining USE_RINTERNALS:

Also be prepared to adjust your code should R internals change.

The same goes for any use of non-API macros and functions.

Best,

luke


On Mon, 29 Jun 2020, Gábor Csárdi wrote:


Hi all,

it seems that from R 4.0.1 EXTPTR_PTR can be either a macro or a
function, depending on whether USE_RINTERNALS is requested.

Jeroen helped me find that this was in 78592:
https://github.com/wch/r-source/commit/c634fec5214e73747b44d7c0e6f047fefe44667d

This is a problem, because binary packages that are built on R 4.0.1
or R 4.0.2 will potentially not load on R 4.0.0, if they use the
EXTPTR_PTR function.

E.g. this is R 4.0.0 on Linux:


library(Rcpp)

Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file,
DLLpath = DLLpath, ...):
unable to load shared object '/usr/local/lib/R/library/Rcpp/libs/Rcpp.so':
 Error relocating /usr/local/lib/R/library/Rcpp/libs/Rcpp.so:
EXTPTR_PTR: symbol not found
In addition: Warning message:
package ‘Rcpp’ was built under R version 4.0.1

It is easiest to reproduce this on Windows, because the CRAN binaries
are now built on R 4.0.2, so if you install Rcpp on R 4.0.0 from CRAN,
and try to load it you'll get:


library(Rcpp)

Error: package or namespace load failed for 'Rcpp' in inDL(x,
as.logical(local), as.logical(now), ...):
unable to load shared object
'C:/Users/csard/R/win-library/4.0/Rcpp/libs/x64/Rcpp.dll':
 LoadLibrary failure:  The specified procedure could not be found.
In addition: Warning message:
package 'Rcpp' was built under R version 4.0.2

I suppose this change was not intended?

Best,
Gabor

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R-devel internal errors during check produce?

2020-06-29 Thread Jan Gorecki
Thank you both,
You are absolutely correct that example should be minimal, so here it is.

l = list(a=new.env(), b=new.env())
unique(l)

Just for completeness, env_list during check that raises error

env_list <- list(baseenv(),
  as.environment("package:graphics"),
  as.environment("package:stats"),
  as.environment("package:utils"),
  as.environment("package:methods")
)
unique(env_list)

Best regards,
Jan

On Mon, Jun 29, 2020 at 5:42 PM Martin Maechler
 wrote:
>
> > Kurt Hornik
> > on Mon, 29 Jun 2020 16:13:03 +0200 writes:
>
> > Jan Gorecki writes:
> >> So the unique.default is from the R tools package during
> >> checks.  I don't see those issues on CRAN checks.
>
> > I cannot reproduce this locally (and have no clues about
> > docker).  Perhaps you can try to debug this on your end?
> > And see what env_list is when the error occurs?
>
> > Best -k
>
> Indeed, if it is a bug in R (as opposed to being an assumption
> that 'data.table' makes about undocumented R internals), it
> should be reproducible with a very small dummy package instead
> of data.table. ... or actually reproducible with relatively
> simple R code calling unique() not envolving any non base package.
>
> Martin
>
>
> >> Exact environment where I am reproducing this issue is a
> >> fresh ubuntu, no R packages pre-installed docker pull
> >> registry.gitlab.com/jangorecki/dockerfiles/r-devel
> >> 
> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile
>
> >> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki
> >>  wrote:
> >>>
> >>> Hi R developers,
> >>>
> >>> On R-devel (2020-06-24 r78746) I am getting those two
> >>> new exceptions during R check. I found a change which
> >>> eventually may be related
> >>> 
> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465
> >>> I think this may be a regression. I grep'ed package
> >>> manuals and R code for unique.default but don't see
> >>> any. Usage section of the unique method looks fine as
> >>> well. Errors look a little bit like internal errors.
> >>>
> >>> * checking Rd \usage sections ... NOTE Error in
> >>> unique.default(env_list) : LENGTH or similar applied to
> >>> environment object Calls: 
> >>> ... .get_S3_generics_as_seen_from_package -> unique ->
> >>> unique.default Execution halted The \usage entries for
> >>> S3 methods should use the \method markup and not their
> >>> full name.  * checking S3 generic/method consistency
> >>> ... WARNING Error in unique.default(env_list) : LENGTH
> >>> or similar applied to environment object Calls:
> >>>  ... .get_S3_generics_as_seen_from_package ->
> >>> unique -> unique.default
> >>>
> >>> I don't think if it is related but I build R-devel with
> >>> extra args: --with-recommended-packages
> >>> --enable-strict-barrier --disable-long-double I check
> >>> with: --as-cran --no-manual To reproduce download
> >>> current data.table from CRAN (1.12.8) and run R check
> >>>
> >>> Best regards, Jan Gorecki
>
> >> __
> >> R-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-29 Thread Kevin Ushey
Did you test with R 4.0.2 or R-devel? A bug related to this issue was
recently fixed:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17833

Best,
Kevin

On Mon, Jun 29, 2020 at 11:51 AM Duncan Murdoch
 wrote:
>
> On 29/06/2020 10:39 a.m., Johannes Rauh wrote:
> > Dear R Developers,
> >
> > I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
> > (tested with R-4.0.0 and R-3.6.3):
> >
> >> p <- "Föö/Bär"
> >> Encoding(p)
> > [1] "latin1"
> >> Encoding(dirname(p))
> > [1] "UTF-8"
> >> Encoding(basename(p))
> > [1] "UTF-8"
> >
> > Is this on purpose?  At least I did not find any relevant comment in the 
> > documentation of `dirname`/`basename`.
> >
> > Background: I'm currently struggeling with a directory name containing a 
> > latin1-character.  (I know that this is a bad idea, but I did not create 
> > the directory and I cannot rename it.)  I now want to pass a 
> > latin1-directory name to a function, which internally uses 
> > `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
> > which changes the encoding, and things break.  If I use `debug` to halt the 
> > processing and "fix" the encoding, things work as expected.
> >
> > So, if possible, I would prefer that `dirname` and `basename` preserve the 
> > encoding.
>
> Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking
> you shouldn't be calling it.  Or perhaps you have a good reason to call
> it, and should be asking for it to be exported, or you are calling a
> published function which calls it:  in either case it should probably be
> fixed to accept UTF-8.
>
> But it doesn't call dirname or basename, so maybe the function that
> calls it is the one that needs fixing.
>
> In any case, while asking dirname() and basename() to preserve the
> encoding sounds reasonable, it seems like it would just be covering up a
> deeper problem.
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible ABI change in R 4.0.1

2020-06-29 Thread Gábor Csárdi
Hi all,

it seems that from R 4.0.1 EXTPTR_PTR can be either a macro or a
function, depending on whether USE_RINTERNALS is requested.

Jeroen helped me find that this was in 78592:
https://github.com/wch/r-source/commit/c634fec5214e73747b44d7c0e6f047fefe44667d

This is a problem, because binary packages that are built on R 4.0.1
or R 4.0.2 will potentially not load on R 4.0.0, if they use the
EXTPTR_PTR function.

E.g. this is R 4.0.0 on Linux:

> library(Rcpp)
Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file,
DLLpath = DLLpath, ...):
 unable to load shared object '/usr/local/lib/R/library/Rcpp/libs/Rcpp.so':
  Error relocating /usr/local/lib/R/library/Rcpp/libs/Rcpp.so:
EXTPTR_PTR: symbol not found
In addition: Warning message:
package ‘Rcpp’ was built under R version 4.0.1

It is easiest to reproduce this on Windows, because the CRAN binaries
are now built on R 4.0.2, so if you install Rcpp on R 4.0.0 from CRAN,
and try to load it you'll get:

> library(Rcpp)
Error: package or namespace load failed for 'Rcpp' in inDL(x,
as.logical(local), as.logical(now), ...):
 unable to load shared object
'C:/Users/csard/R/win-library/4.0/Rcpp/libs/x64/Rcpp.dll':
  LoadLibrary failure:  The specified procedure could not be found.
In addition: Warning message:
package 'Rcpp' was built under R version 4.0.2

I suppose this change was not intended?

Best,
Gabor

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-29 Thread Duncan Murdoch

On 29/06/2020 10:39 a.m., Johannes Rauh wrote:

Dear R Developers,

I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
(tested with R-4.0.0 and R-3.6.3):


p <- "Föö/Bär"
Encoding(p)

[1] "latin1"

Encoding(dirname(p))

[1] "UTF-8"

Encoding(basename(p))

[1] "UTF-8"

Is this on purpose?  At least I did not find any relevant comment in the 
documentation of `dirname`/`basename`.

Background: I'm currently struggeling with a directory name containing a 
latin1-character.  (I know that this is a bad idea, but I did not create the directory 
and I cannot rename it.)  I now want to pass a latin1-directory name to a function, which 
internally uses `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
which changes the encoding, and things break.  If I use `debug` to halt the processing 
and "fix" the encoding, things work as expected.

So, if possible, I would prefer that `dirname` and `basename` preserve the 
encoding.


Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking 
you shouldn't be calling it.  Or perhaps you have a good reason to call 
it, and should be asking for it to be exported, or you are calling a 
published function which calls it:  in either case it should probably be 
fixed to accept UTF-8.


But it doesn't call dirname or basename, so maybe the function that 
calls it is the one that needs fixing.


In any case, while asking dirname() and basename() to preserve the 
encoding sounds reasonable, it seems like it would just be covering up a 
deeper problem.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R-devel internal errors during check produce?

2020-06-29 Thread Martin Maechler
> Kurt Hornik 
> on Mon, 29 Jun 2020 16:13:03 +0200 writes:

> Jan Gorecki writes:
>> So the unique.default is from the R tools package during
>> checks.  I don't see those issues on CRAN checks.

> I cannot reproduce this locally (and have no clues about
> docker).  Perhaps you can try to debug this on your end?
> And see what env_list is when the error occurs?

> Best -k

Indeed, if it is a bug in R (as opposed to being an assumption
that 'data.table' makes about undocumented R internals), it
should be reproducible with a very small dummy package instead
of data.table. ... or actually reproducible with relatively
simple R code calling unique() not envolving any non base package.

Martin


>> Exact environment where I am reproducing this issue is a
>> fresh ubuntu, no R packages pre-installed docker pull
>> registry.gitlab.com/jangorecki/dockerfiles/r-devel
>> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile

>> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki
>>  wrote:
>>> 
>>> Hi R developers,
>>> 
>>> On R-devel (2020-06-24 r78746) I am getting those two
>>> new exceptions during R check. I found a change which
>>> eventually may be related
>>> 
https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465
>>> I think this may be a regression. I grep'ed package
>>> manuals and R code for unique.default but don't see
>>> any. Usage section of the unique method looks fine as
>>> well. Errors look a little bit like internal errors.
>>> 
>>> * checking Rd \usage sections ... NOTE Error in
>>> unique.default(env_list) : LENGTH or similar applied to
>>> environment object Calls: 
>>> ... .get_S3_generics_as_seen_from_package -> unique ->
>>> unique.default Execution halted The \usage entries for
>>> S3 methods should use the \method markup and not their
>>> full name.  * checking S3 generic/method consistency
>>> ... WARNING Error in unique.default(env_list) : LENGTH
>>> or similar applied to environment object Calls:
>>>  ... .get_S3_generics_as_seen_from_package ->
>>> unique -> unique.default
>>> 
>>> I don't think if it is related but I build R-devel with
>>> extra args: --with-recommended-packages
>>> --enable-strict-barrier --disable-long-double I check
>>> with: --as-cran --no-manual To reproduce download
>>> current data.table from CRAN (1.12.8) and run R check
>>> 
>>> Best regards, Jan Gorecki

>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-29 Thread Johannes Rauh
Dear R Developers,

I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
(tested with R-4.0.0 and R-3.6.3):

> p <- "Föö/Bär"
> Encoding(p)
[1] "latin1"
> Encoding(dirname(p))
[1] "UTF-8"
> Encoding(basename(p))
[1] "UTF-8"

Is this on purpose?  At least I did not find any relevant comment in the 
documentation of `dirname`/`basename`.

Background: I'm currently struggeling with a directory name containing a 
latin1-character.  (I know that this is a bad idea, but I did not create the 
directory and I cannot rename it.)  I now want to pass a latin1-directory name 
to a function, which internally uses `tools::makeLazyLoadDB`.  At that point, 
internally, `dirname` is called, which changes the encoding, and things break.  
If I use `debug` to halt the processing and "fix" the encoding, things work as 
expected.

So, if possible, I would prefer that `dirname` and `basename` preserve the 
encoding.

Best regards
Johannes

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] "R CMD Sweave --driver=..." woes

2020-06-29 Thread Kurt Hornik
> Vincent Goulet via R-devel writes:

Thanks: fixed now in the trunk with c78751.

Best
-k

> In trying to change the driver used by Sweave on the command line using 
>R CMD Sweave --driver=foo

> I consistently get the "directory 'foo' does not exist' error. (For any value 
> of 'foo', even the default 'RweaveLatex'.)

> Looking up the source code for function .Sweave that is called by 'R CMD 
> Sweave', I notice that the argument 'driver', if used, is added to the vector 
> of arguments of ''buildVignette' without being named. It ends up being passed 
> to argument 'dir', hence rhe error.

> I believe the simple patch below should fix the issue, but I wasn't able to 
> test it.

> Hope this helps.

> v.

> Vincent Goulet
> Professeur titulaire
> École d'actuariat, Université Laval


> Index: src/library/utils/R/Sweave.R
> ===
> --- src/library/utils/R/Sweave.R  (revision 78746)
> +++ src/library/utils/R/Sweave.R  (working copy)
> @@ -516,7 +516,7 @@
> do_exit(1L)
> }
> args <- list(file=file, tangle=FALSE, latex=toPDF, engine=engine, 
> clean=clean)
> -if(nzchar(driver)) args <- c(args, driver)
> +if(nzchar(driver)) args <- c(args, driver=driver)
> args <- c(args, encoding = encoding)
> if(nzchar(options)) {
> opts <- eval(str2expression(paste0("list(", options, ")")))

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R-devel internal errors during check produce?

2020-06-29 Thread Kurt Hornik
> Jan Gorecki writes:

> So the unique.default is from the R tools package during checks.
> I don't see those issues on CRAN checks.

I cannot reproduce this locally (and have no clues about docker).
Perhaps you can try to debug this on your end?  And see what env_list is
when the error occurs?

Best
-k


> Exact environment where I am reproducing this issue is a fresh ubuntu,
> no R packages pre-installed
> docker pull registry.gitlab.com/jangorecki/dockerfiles/r-devel
> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile

> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki  wrote:
>> 
>> Hi R developers,
>> 
>> On R-devel (2020-06-24 r78746) I am getting those two new exceptions
>> during R check. I found a change which eventually may be related
>> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465
>> I think this may be a regression. I grep'ed package manuals and R code
>> for unique.default but don't see any. Usage section of the unique
>> method looks fine as well. Errors look a little bit like internal
>> errors.
>> 
>> * checking Rd \usage sections ... NOTE
>> Error in unique.default(env_list) :
>> LENGTH or similar applied to environment object
>> Calls:  ... .get_S3_generics_as_seen_from_package ->
>> unique -> unique.default
>> Execution halted
>> The \usage entries for S3 methods should use the \method markup and not
>> their full name.
>> * checking S3 generic/method consistency ... WARNING
>> Error in unique.default(env_list) :
>> LENGTH or similar applied to environment object
>> Calls:  ... .get_S3_generics_as_seen_from_package ->
>> unique -> unique.default
>> 
>> I don't think if it is related but I build R-devel with extra args:
>> --with-recommended-packages --enable-strict-barrier --disable-long-double
>> I check with:
>> --as-cran --no-manual
>> To reproduce download current data.table from CRAN (1.12.8) and run R check
>> 
>> Best regards,
>> Jan Gorecki

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] A warning in gzcon but not in gzfile

2020-06-29 Thread Jeff King
Hi all,

I used `gzfile` and `gzcon` to read a compressed file but I found that
`gzcon` gave me a different result than `gzfile`. It seems like the `gzcon`
does not handle the data correctly. I have posted an example below. In the
example, a portion of a compressed file is downloaded from Google Cloud as
a raw vector, and the data is saved into a temp file. If I use ` gzfile` to
read the file, it can show the first 1000 lines successfully. However, if I
wrap the raw vector as a connection, and use  `gzcon` to read from that
connection, it shows the first  884 lines along with a warning(see the
output).

code:

> # installed.packages("BiocManager")
> # BiocManager::install("GCSConnection", version = "devel")
> library(GCSConnection)
> ## Download data from cloud
> uri <-
> "gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz"
> con <- gcs_connection(uri)
> data <- readBin(con, raw(), 4*1024*1024)
> close(con)
>


## write data to a file
> file_path <- tempfile()
> writeBin(data, file_path)
>


## Read the data using `gzfile`
> con1 <- gzfile(file_path)
> str(readLines(con1, 1000))
>


## Read the data using `gzcon`
> ## We create a raw connection from the raw vector
> con2 <- gzcon(rawConnection(data))
> str(readLines(con2, 1000))


output:

> > str(readLines(con1, 1000))
>  chr [1:1000] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd"
> ...
> > str(readLines(con2, 1000))
>  chr [1:884] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ...
> Warning message:
> In readLines(con2, 1000) : incomplete final line found on 'gzcon(data)'


I am not sure if this is caused by a bug in `gzcon` or the misuse of the
function. The same result can be observed at R4.0 and R4.1 devel on Win.
Here is my session info, I hope it can be helpful. Any suggestions and help
would be appreciated.

R Under development (unstable) (2020-06-27 r78747)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 18363)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
> system code page: 65001



Best,
Jiefei

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Error in substring: invalid multibyte string

2020-06-29 Thread Tomas Kalibera
From the user's (or package author's) point, all strings should always 
be valid in their declared encoding. If they are not, the result of 
string operations is undefined - it may be an error or warning, but also 
silently produced correct or incorrect result. There are R functions 
that check if a string is valid. In this example, the string was invalid 
in its declared encoding.


From the viewpoint of R implementation (or of external software), some 
operations such as substring can be carried out in a well defined way 
even on strings with invalid characters or characters invalid in 
specific ways, usually only in some encodings (e.g. UTF-8), and the 
implementation is then more complicated. Some operations can't be well 
defined on such strings.


It may seem it would make sense to ban all invalid strings (not allow 
their creation) as not to mask errors like the one you have encountered, 
but it is sometimes better for debugging to be able to include invalid 
strings in error and diagnostic messages. Moreover, some systems support 
invalid strings in some operations also as they may appear in file 
names. On Windows, file names may include unpaired UTF-16 surrogates, 
which can't be represented in UTF-8. Some systems allow representing 
invalid strings in a custom way that is a valid string but preserves the 
information, only in some encodings (e.g. in UTF-8).


So differences in how invalid strings are treated by different R 
functions are to be expected. The same applies to differences wrt to 
external software. Some may be optimized for UTF-8 and support invalid 
strings in more cases (R does not support substring on invalid strings), 
of course other may have bugs or intentionally may not check strings for 
validity when that is perceived too slow in given operation.


Best
Tomas


On 6/28/20 12:38 AM, Toby Hocking wrote:

Thanks for the quick response Ivan. readLines with encoding='latin1' works
for me (on Ubuntu).

However I was more concerned with the inconsistency in results between
substr and regexpr. I was expecting that if one of them errors because of
an unknown encoding then the other should as well. Even better, if regexpr
works, why shouldn't substr work as well?

Incidentally the analogous stringi function stri_sub works fine in this
case:


stringi::stri_sub("Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)

[1] "Jens Oehlschl\xe4gel-Akiyoshi"

But the stringi analog to nchar gives a similar warning:


stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi")

[1] NA
Warning message:
In stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi") :
   invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()


On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov  wrote:


On Fri, 26 Jun 2020 15:57:06 -0700
Toby Hocking  wrote:


invalid multibyte string at 'gel-A<6b>iyoshi'
https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html

The server says that the text is UTF-8:

curl -sI \
  https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
  grep Content-Type
# Content-Type: text/html; charset=UTF-8

But it's not, at least not all of it. If you ask readLines to mark
the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
mojibake and invalid multi-byte characters:

x <- readLines(
  'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
  encoding = 'latin1'
)[28]
substr(x, 1, 100)
# [1] "Jens Oehlschlägel-Akiyoshi"

The behaviour we observe when encoding = 'latin1' is not specified
results from returned lines having "unknown" encoding. The substr()
implementation tries to interpret such strings according to multi-byte C
locale rules (using mbrtowc(3)). On my system (yours too, probably, if
it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
and this Latin-1 string does not result in valid code points when
decoded as UTF-8.

--
Best regards,
Ivan


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel