Re: [Rd] WISH: Built-in R session-specific universally unique identifier (UUID)

2019-05-20 Thread William Dunlap via R-devel
I think a machine-specific input, like the MAC address, to the UUID is
essential.  S+ used to make a seed for the random number generator based on
the the current time and process ID.  A customer complained that all
machines in his cluster generated the same random number stream.  The
machines were rebooted each night, simultaneously, and S+ was started
during the boot process so times and process ids were identical, hence the
seeds were identical.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Mon, May 20, 2019 at 4:48 PM Henrik Bengtsson 
wrote:

> # Proposal
>
> Provide a built-in mechanism for obtaining an identifier for the
> current R session, e.g.
>
> > Sys.info()[["session_uuid"]]
> [1] "4258db4d-d4fb-46b3-a214-8c762b99a443"
>
> The identifier should be "unique" in the sense that the probability
> for two R sessions(*) having the same identifier should be extremely
> small.  There's no need for reproducibility, i.e. the algorithm for
> producing the identifier may be changed at any time.
>
> (*) Two R sessions running at different times (seconds, minutes, days,
> years, ...) or on different machines (locally or anywhere in the
> world).
>
>
> # Use cases
>
> In parallel-processing workflows, R objects may be "exported"
> (serialized) to background R processes ("workers") for further
> processing.  In other workflows, objects may be saved to file to be
> reloaded in a future R session.  However, certain types of objects in
> R maybe only be relevant, or valid, in the R session that created
> them.  Attempts to use them in other R processes may give an obscure
> error or in the worst case produce garbage results.
>
> Having an identifier that is unique to each R process will make it
> possible to detect when an object is used in the wrong context.  This
> can be done by attaching the session identifier to the object.  For
> example,
>
> obj <- 42L
> attr(obj, "owner") <- Sys.info()[["session_uuid"]]
>
> With this, it is easy to validate the "ownership" later;
>
> stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))
>
> I argue that such an identifier should be part of base R for easy
> access and avoid each developer having to roll their own.
>
>
> # Possible implementation
>
> One proposal would be to bring in Simon Urbanek's 'uuid' package
> (https://cran.r-project.org/package=uuid) into base R.  This package
> provides:
>
> > uuid::UUIDgenerate()
> [1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"
>
> based on Theodore Ts'o's libuuid
> (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/).  From
> 'man uuid_generate':
>
> "The uuid_generate function creates a new universally unique
> identifier (UUID). The uuid will be generated based on high-quality
> randomness from /dev/urandom, if available. If it is not available,
> then uuid_generate will use an alternative algorithm which uses the
> current time, the local ethernet MAC address (if available), and
> random data generated using a pseudo-random generator.
> [...]
> The UUID is 16 bytes (128 bits) long, which gives approximately
> 3.4x10^38 unique values (there are approximately 10^80 elementary
> particles in the universe according to Carl Sagan's Cosmos). The new
> UUID can reasonably be considered unique among all UUIDs created on
> the local system, and among UUIDs created on other systems in the past
> and in the future."
>
> An alternative, that does not require adding a dependency on the
> libuuid library, would be to roll a poor man's version based on a set
> of semi-unique attributes, e.g.
>
> make_id <- function(...) {
>   args <- list(...)
>   saveRDS(args, file = f <- tempfile())
>   on.exit(file.remove(f))
>   unname(tools::md5sum(f))
> }
>
> session_id <- local({
>   id <- NULL
>   function() {
> if (is.null(id)) {
>   id <<- make_id(
> info= Sys.info(),
> pid = Sys.getpid(),
> tempdir = tempdir(),
> time= Sys.time(),
> random  = sample.int(.Machine$integer.max, size = 1L)
>   )
> }
> id
>   }
> })
>
> Example:
>
> > session_id()
> [1] "8d00b17384e69e7c9ecee47e0426b2a5"
>
> > session_id()
> [1] "8d00b17384e69e7c9ecee47e0426b2a5"
>
> /Henrik
>
> PS. Having a built-in make_id() function would be handy too, e.g. when
> creating object-specific identifiers for other purposes.
>
> PPS. It would be neat if there was an object, or connection, interface
> for tools::md5sum(), which currently only operates on files sitting on
> the file system. The digest package provides this functionality.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] WISH: Built-in R session-specific universally unique identifier (UUID)

2019-05-20 Thread Henrik Bengtsson
# Proposal

Provide a built-in mechanism for obtaining an identifier for the
current R session, e.g.

> Sys.info()[["session_uuid"]]
[1] "4258db4d-d4fb-46b3-a214-8c762b99a443"

The identifier should be "unique" in the sense that the probability
for two R sessions(*) having the same identifier should be extremely
small.  There's no need for reproducibility, i.e. the algorithm for
producing the identifier may be changed at any time.

(*) Two R sessions running at different times (seconds, minutes, days,
years, ...) or on different machines (locally or anywhere in the
world).


# Use cases

In parallel-processing workflows, R objects may be "exported"
(serialized) to background R processes ("workers") for further
processing.  In other workflows, objects may be saved to file to be
reloaded in a future R session.  However, certain types of objects in
R maybe only be relevant, or valid, in the R session that created
them.  Attempts to use them in other R processes may give an obscure
error or in the worst case produce garbage results.

Having an identifier that is unique to each R process will make it
possible to detect when an object is used in the wrong context.  This
can be done by attaching the session identifier to the object.  For
example,

obj <- 42L
attr(obj, "owner") <- Sys.info()[["session_uuid"]]

With this, it is easy to validate the "ownership" later;

stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))

I argue that such an identifier should be part of base R for easy
access and avoid each developer having to roll their own.


# Possible implementation

One proposal would be to bring in Simon Urbanek's 'uuid' package
(https://cran.r-project.org/package=uuid) into base R.  This package
provides:

> uuid::UUIDgenerate()
[1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"

based on Theodore Ts'o's libuuid
(https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/).  From
'man uuid_generate':

"The uuid_generate function creates a new universally unique
identifier (UUID). The uuid will be generated based on high-quality
randomness from /dev/urandom, if available. If it is not available,
then uuid_generate will use an alternative algorithm which uses the
current time, the local ethernet MAC address (if available), and
random data generated using a pseudo-random generator.
[...]
The UUID is 16 bytes (128 bits) long, which gives approximately
3.4x10^38 unique values (there are approximately 10^80 elementary
particles in the universe according to Carl Sagan's Cosmos). The new
UUID can reasonably be considered unique among all UUIDs created on
the local system, and among UUIDs created on other systems in the past
and in the future."

An alternative, that does not require adding a dependency on the
libuuid library, would be to roll a poor man's version based on a set
of semi-unique attributes, e.g.

make_id <- function(...) {
  args <- list(...)
  saveRDS(args, file = f <- tempfile())
  on.exit(file.remove(f))
  unname(tools::md5sum(f))
}

session_id <- local({
  id <- NULL
  function() {
if (is.null(id)) {
  id <<- make_id(
info= Sys.info(),
pid = Sys.getpid(),
tempdir = tempdir(),
time= Sys.time(),
random  = sample.int(.Machine$integer.max, size = 1L)
  )
}
id
  }
})

Example:

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

/Henrik

PS. Having a built-in make_id() function would be handy too, e.g. when
creating object-specific identifiers for other purposes.

PPS. It would be neat if there was an object, or connection, interface
for tools::md5sum(), which currently only operates on files sitting on
the file system. The digest package provides this functionality.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel