Re: [Rd] readLines interaction with gsub different in R-dev

2018-02-19 Thread Tomas Kalibera

Thank you for the report and analysis. Now fixed in R-devel.
Tomas

On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:

I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Amélie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "", txt[1])
#[1] "", txt[2])
#[1] ""

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage 
wrote:


| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Amélie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
  "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AMéLIE"  # latin1 encoding


A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.




On 18 February 2018 at 02:15, Dirk Eddelbuettel  wrote:

On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >... For perl = TRUE only, it can also contain "\U" or "\L" to

convert the rest of the replacement to upper or lower case and "\E" to end
case conversion.

|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AMÉLIE"  # R-3.4.3
|
| "A"  # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the

regexp

you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AMÉLIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] readLines interaction with gsub different in R-dev

2018-02-17 Thread William Dunlap via R-devel
I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Amélie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "", txt[1])
#[1] "", txt[2])
#[1] ""

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage 
wrote:

> | Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
> regexp
> | you use wrong, ie isn't R-devel giving the correct answer?
>
> No, I don't think R-devel is correct (or at least consistent with the
> documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
> perl = TRUE) is "Take every word character and replace it with itself,
> converted to uppercase."
>
> Perhaps my example was too minimal. Consider the following:
>
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
>
> R> gsub("(\\w)", "\\1", entry, perl = TRUE)
> [1] "author: Amélie"   # OK, but very different to 'A', despite only
> not specifying uppercase
>
> R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
> [1] "AUTHOR: AMELIE"  # OK, but very different to 'A',
>
> R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
>  "AUTHOR"  # Where did everything after the first group go?
>
> I should note the following example too:
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
> [1] "AUTHOR: AMéLIE"  # latin1 encoding
>
>
> A call to `readLines` (possibly `scan()` and `read.table` and friends)
> is essential.
>
>
>
>
> On 18 February 2018 at 02:15, Dirk Eddelbuettel  wrote:
> >
> > On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> > | I was told to re-raise this issue with R-dev:
> > |
> > | In the documentation of R-dev and R-3.4.3, under ?gsub
> > |
> > | > replacement
> > | >... For perl = TRUE only, it can also contain "\U" or "\L" to
> convert the rest of the replacement to upper or lower case and "\E" to end
> case conversion.
> > |
> > | However, the following code runs differently:
> > |
> > | tempf <- tempfile()
> > | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> > | entry <- readLines(tempf, encoding = "UTF-8")
> > | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> > |
> > |
> > | "AUTHOR: AMÉLIE"  # R-3.4.3
> > |
> > | "A"  # R-dev
> >
> > Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
> regexp
> > you use wrong, ie isn't R-devel giving the correct answer?
> >
> > R> tempf <- tempfile()
> > R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> > R> entry <- readLines(tempf, encoding = "UTF-8")
> > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> > [1] "A"
> > R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> > [1] "AUTHOR"
> > R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> > [1] "AUTHOR: AMÉLIE"
> > R>
> >
> > Dirk
> >
> > --
> > http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] readLines interaction with gsub different in R-dev

2018-02-17 Thread Hugh Parsonage
| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Amélie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
 "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AMéLIE"  # latin1 encoding


A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.




On 18 February 2018 at 02:15, Dirk Eddelbuettel  wrote:
>
> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> | I was told to re-raise this issue with R-dev:
> |
> | In the documentation of R-dev and R-3.4.3, under ?gsub
> |
> | > replacement
> | >... For perl = TRUE only, it can also contain "\U" or "\L" to convert 
> the rest of the replacement to upper or lower case and "\E" to end case 
> conversion.
> |
> | However, the following code runs differently:
> |
> | tempf <- tempfile()
> | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> | entry <- readLines(tempf, encoding = "UTF-8")
> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> |
> |
> | "AUTHOR: AMÉLIE"  # R-3.4.3
> |
> | "A"  # R-dev
>
> Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
> you use wrong, ie isn't R-devel giving the correct answer?
>
> R> tempf <- tempfile()
> R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> R> entry <- readLines(tempf, encoding = "UTF-8")
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR"
> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR: AMÉLIE"
> R>
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] readLines interaction with gsub different in R-dev

2018-02-17 Thread Dirk Eddelbuettel

On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
| 
| In the documentation of R-dev and R-3.4.3, under ?gsub
| 
| > replacement
| >... For perl = TRUE only, it can also contain "\U" or "\L" to convert 
the rest of the replacement to upper or lower case and "\E" to end case 
conversion.
| 
| However, the following code runs differently:
| 
| tempf <- tempfile()
| writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
| 
| 
| "AUTHOR: AMÉLIE"  # R-3.4.3
| 
| "A"  # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AMÉLIE"
R> 

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel