Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-30 Thread Johannes Rauh
Hello, everyone,

thank you for your quick and helpful responses and the detailed information.

Sorry for not providing a reproducible example for the (potential) bug in 
`tools::makeLazyLoadDB`.  The main point of my mail was the surprising 
behaviour of `basename` and `dirname`.  Fixing those functions would probably 
solve my problem for me (as a workaround, probably hiding some underlying 
problem, and likely leading to a failure for someone else fighting with 
encodings).

Concerning my underlying direct problem with `tools::makeLazyLoadDB`, I'm 
having difficulty to make my example reproducible.  I'm trying to use a 
directory with a non-ASCII-name for a knitr cache.  My R-4.0.0 here behaves 
different from my R-3.6.3, but when I filed a bug report with knitr, Yihui 
could not reproduce this difference 
(https://github.com/yihui/knitr/issues/1840).  So I'll try R-4.0.2 next, let's 
see what happens.

Cheers
Johannes

> Gesendet: Dienstag, 30. Juni 2020 um 09:25 Uhr
> Von: "Tomas Kalibera" 
> An: "Johannes Rauh" , "r-devel" 
> Betreff: Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"
>
> On 6/29/20 4:39 PM, Johannes Rauh wrote:
> > Dear R Developers,
> >
> > I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
> > (tested with R-4.0.0 and R-3.6.3):
> >
> >> p <- "Föö/Bär"
> >> Encoding(p)
> > [1] "latin1"
> >> Encoding(dirname(p))
> > [1] "UTF-8"
> >> Encoding(basename(p))
> > [1] "UTF-8"
> >
> > Is this on purpose?  At least I did not find any relevant comment in the 
> > documentation of `dirname`/`basename`.
> > Background: I'm currently struggeling with a directory name containing a 
> > latin1-character.  (I know that this is a bad idea, but I did not create 
> > the directory and I cannot rename it.)  I now want to pass a 
> > latin1-directory name to a function, which internally uses 
> > `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
> > which changes the encoding, and things break.  If I use `debug` to halt the 
> > processing and "fix" the encoding, things work as expected.
> >
> > So, if possible, I would prefer that `dirname` and `basename` preserve the 
> > encoding.
> 
> Please try to always submit a minimal reproducible example with your 
> reports and test with at least the latest released version of R, ideally 
> also with R-devel.
> 
> As you have not sent a reproducible example, it is hard to tell for 
> sure, but most likely as Kevin wrote you have run into a real bug, which 
> was however already fixed in 4.0.2 and in R-devel (17833). The lazy 
> loading cache did not work with file names in non-native encoding.
> 
> That real bug has been uncovered by legitimate and correct changes like 
> the ones you report, where file operations started returning non-ASCII 
> strings in UTF-8. Historically in R such functions would instead return 
> native strings with misrepresented characters, and we were reluctant to 
> change that expecting waking bugs in code silently assuming native 
> encoding. Still, as people were increasingly running into problems with 
> non-representable characters, we did that change in several functions 
> anyway, and yes, it started waking up bugs.
> 
> With some performance overhead and added complexity, we could be 
> returning preferentially results in native encoding, and in UTF-8 only 
> when they included non-representable characters. That would increase the 
> code complexity, increase performance overhead, but wake up existing 
> bugs with smaller probability.  Note - some code that relied previously 
> on best-fit conversions done by Windows will have been broken anyway. We 
> would have to bypass win_iconv/iconv for that (adding more complexity). 
> Bugs in code not handling encodings properly would still be triggered 
> via non-representable characters. I've recently changed file.path() in 
> R-devel to be slightly more conservative again, along these lines.
> 
> We can still do it more widely, but it is not high on the priority list. 
> The way to fix all of these problems is switching to UTF-8 as native 
> encoding on Windows and every day spent on tuning the existing behavior 
> postpones that real solution.
> 
> Best
> Tomas
> 
> 
> >
> > Best regards
> > Johannes
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-30 Thread Tomas Kalibera

On 6/29/20 4:39 PM, Johannes Rauh wrote:

Dear R Developers,

I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
(tested with R-4.0.0 and R-3.6.3):


p <- "Föö/Bär"
Encoding(p)

[1] "latin1"

Encoding(dirname(p))

[1] "UTF-8"

Encoding(basename(p))

[1] "UTF-8"

Is this on purpose?  At least I did not find any relevant comment in the 
documentation of `dirname`/`basename`.
Background: I'm currently struggeling with a directory name containing a 
latin1-character.  (I know that this is a bad idea, but I did not create the directory 
and I cannot rename it.)  I now want to pass a latin1-directory name to a function, which 
internally uses `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
which changes the encoding, and things break.  If I use `debug` to halt the processing 
and "fix" the encoding, things work as expected.

So, if possible, I would prefer that `dirname` and `basename` preserve the 
encoding.


Please try to always submit a minimal reproducible example with your 
reports and test with at least the latest released version of R, ideally 
also with R-devel.


As you have not sent a reproducible example, it is hard to tell for 
sure, but most likely as Kevin wrote you have run into a real bug, which 
was however already fixed in 4.0.2 and in R-devel (17833). The lazy 
loading cache did not work with file names in non-native encoding.


That real bug has been uncovered by legitimate and correct changes like 
the ones you report, where file operations started returning non-ASCII 
strings in UTF-8. Historically in R such functions would instead return 
native strings with misrepresented characters, and we were reluctant to 
change that expecting waking bugs in code silently assuming native 
encoding. Still, as people were increasingly running into problems with 
non-representable characters, we did that change in several functions 
anyway, and yes, it started waking up bugs.


With some performance overhead and added complexity, we could be 
returning preferentially results in native encoding, and in UTF-8 only 
when they included non-representable characters. That would increase the 
code complexity, increase performance overhead, but wake up existing 
bugs with smaller probability.  Note - some code that relied previously 
on best-fit conversions done by Windows will have been broken anyway. We 
would have to bypass win_iconv/iconv for that (adding more complexity). 
Bugs in code not handling encodings properly would still be triggered 
via non-representable characters. I've recently changed file.path() in 
R-devel to be slightly more conservative again, along these lines.


We can still do it more widely, but it is not high on the priority list. 
The way to fix all of these problems is switching to UTF-8 as native 
encoding on Windows and every day spent on tuning the existing behavior 
postpones that real solution.


Best
Tomas




Best regards
Johannes

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-29 Thread Kevin Ushey
Did you test with R 4.0.2 or R-devel? A bug related to this issue was
recently fixed:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17833

Best,
Kevin

On Mon, Jun 29, 2020 at 11:51 AM Duncan Murdoch
 wrote:
>
> On 29/06/2020 10:39 a.m., Johannes Rauh wrote:
> > Dear R Developers,
> >
> > I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
> > (tested with R-4.0.0 and R-3.6.3):
> >
> >> p <- "Föö/Bär"
> >> Encoding(p)
> > [1] "latin1"
> >> Encoding(dirname(p))
> > [1] "UTF-8"
> >> Encoding(basename(p))
> > [1] "UTF-8"
> >
> > Is this on purpose?  At least I did not find any relevant comment in the 
> > documentation of `dirname`/`basename`.
> >
> > Background: I'm currently struggeling with a directory name containing a 
> > latin1-character.  (I know that this is a bad idea, but I did not create 
> > the directory and I cannot rename it.)  I now want to pass a 
> > latin1-directory name to a function, which internally uses 
> > `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
> > which changes the encoding, and things break.  If I use `debug` to halt the 
> > processing and "fix" the encoding, things work as expected.
> >
> > So, if possible, I would prefer that `dirname` and `basename` preserve the 
> > encoding.
>
> Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking
> you shouldn't be calling it.  Or perhaps you have a good reason to call
> it, and should be asking for it to be exported, or you are calling a
> published function which calls it:  in either case it should probably be
> fixed to accept UTF-8.
>
> But it doesn't call dirname or basename, so maybe the function that
> calls it is the one that needs fixing.
>
> In any case, while asking dirname() and basename() to preserve the
> encoding sounds reasonable, it seems like it would just be covering up a
> deeper problem.
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

2020-06-29 Thread Duncan Murdoch

On 29/06/2020 10:39 a.m., Johannes Rauh wrote:

Dear R Developers,

I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
(tested with R-4.0.0 and R-3.6.3):


p <- "Föö/Bär"
Encoding(p)

[1] "latin1"

Encoding(dirname(p))

[1] "UTF-8"

Encoding(basename(p))

[1] "UTF-8"

Is this on purpose?  At least I did not find any relevant comment in the 
documentation of `dirname`/`basename`.

Background: I'm currently struggeling with a directory name containing a 
latin1-character.  (I know that this is a bad idea, but I did not create the directory 
and I cannot rename it.)  I now want to pass a latin1-directory name to a function, which 
internally uses `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
which changes the encoding, and things break.  If I use `debug` to halt the processing 
and "fix" the encoding, things work as expected.

So, if possible, I would prefer that `dirname` and `basename` preserve the 
encoding.


Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking 
you shouldn't be calling it.  Or perhaps you have a good reason to call 
it, and should be asking for it to be exported, or you are calling a 
published function which calls it:  in either case it should probably be 
fixed to accept UTF-8.


But it doesn't call dirname or basename, so maybe the function that 
calls it is the one that needs fixing.


In any case, while asking dirname() and basename() to preserve the 
encoding sounds reasonable, it seems like it would just be covering up a 
deeper problem.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel