Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-18 Thread Charlie Gao via R-devel
> --
> 
> Date: Wed, 17 Jan 2024 11:35:02 -0500
> 
> From: Dipterix Wang 
> 
> To: Lionel Henry , Tomas Kalibera
> 
>  
> 
> Cc: r-devel@r-project.org
> 
> Subject: Re: [Rd] Choices to remove `srcref` (and its buddies) when
> 
>  serializing objects
> 
> Message-ID: <3cf4ca2d-9f72-4c7b-90aa-4d2e9f745...@gmail.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> > 
> > 
> >  
> > 
> >  On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
> > 
> >   wrote:
> > 
> > > 
> > > I think one could implement hashing on the fly without any
> > > 
> > >  serialization, similarly to how identical works, but I am not aware of
> > > 
> > >  any existing implementation. Again, if that wasn't clear: I don't think
> > > 
> > >  trying to compute a hash of an object from its serialized representation
> > > 
> > >  is a good idea - it is of course convenient, but has problems like the
> > > 
> > >  one you have ran into.
> > > 
> > >  
> > > 
> > >  In some applications it may still be good enough: if by various tweaks,
> > > 
> > >  such as ensuring source references are off in your case, you achieve a
> > > 
> > >  state when false alarms are rare (identical objects have different
> > > 
> > >  hashes), and hence say unnecessary re-computation is rare, maybe it is
> > > 
> > >  good enough.
> > >
> > 
> 
> I really appreciate you answer my questions and solve my puzzles. I went back 
> and read the R internal code for `serialize` and totally agree on this, that 
> serialization is not a good idea for digesting R objects, especially on 
> environments, expressions, and functions. 
> 
> What I want is a function that can produce the same and stable hash for 
> identical objects. However, there is no function (given our best knowledge) 
> on the market that can do this. `digest::digest` and `rlang::hash` are the 
> first functions that come into my mind. Both are widely used, but they use 
> serialize. The author of `digest` said:
> 
>  > "As you know, digest takes and (ahem) "digests" what serialize gives it, 
> so you would have to look into what serialize lets you do."
> 
> vctrs:::obj_hash is probably the closest to the implementation of 
> `identical`, but the above examples give different results for identical 
> objects.
> 
> The existence of digest:: digest and rlang::hash shows that there is a huge 
> demand for this "ideal" hash function. However, I bet most people are using 
> digest/hash "incorrectly".

Please read the full discussion to this old bug report: 
https://bugs.r-project.org/show_bug.cgi?id=18178

Quoting briefly: Serialization is not intended to be used this way. What 
serialization tries to provide is that x and unserialize(serialize(x, NULL)) 
will be identical() while preserving internal representation where possible. 
Two objects that are considered identical() can have very different internal 
representations, and their serializations will reflect this.

You will see that it is not as simple as just removing the srcref or the 
bytecode to functions. The issue with the `identical()` function in that 
context was eventually patched, but the comment by R-Core that serialization is 
not intended to be used to produce a reliable hash stands. Use of `identical()` 
or `serialize()` is simply not designed to ensure the same hashable object (in 
terms of bytes).

This is echoed by Tomas' comment above. But we note that it is 'good enough' in 
most cases.

Fwiw `nanonext::sha256()` and family directly hashes character strings and raw 
objects, but uses the same approach as `digest::digest()` elsewhere. So if 
someone comes up with a canonical binary representation of R objects, it will 
be able to hash it reliably.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-18 Thread Ivan Krylov via R-devel
В Tue, 16 Jan 2024 14:16:19 -0500
Dipterix Wang  пишет:

> Could you recommend any packages/functions that compute hash such
> that the source references and sexpinfo_struct are ignored? Basically
> a version of `serialize` that convert R objects to raw without
> storing the ancillary source reference and sexpinfo.

I can show how this can be done, but it's not currently on CRAN or even
a well-defined package API. I have adapted a copy of R's serialize()
[*] with the following changes:

 * Function bytecode and flags are ignored:

f <- function() invisible()
depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output
# [1] "9b7a1af5468deba4"
.Call(depcache:::C_hash2, f) # This is the new hash
[1] 91 5f b8 a1 b0 6b cb 40
f() # called once: function gets the MAYBEJIT_MASK flag
depcache:::hash(f, 2)
# [1] "7d30e05546e7a230"
.Call(depcache:::C_hash2, f)
# [1] 91 5f b8 a1 b0 6b cb 40
f() # called twice: function now has bytecode
depcache:::hash(f, 2)
# [1] "2a2cba4150e722b8"
.Call(depcache:::C_hash2, f)
# [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same

 * Source references are ignored:

.Call(depcache:::C_hash2, \( ) invisible( ))
# [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above

# For quoted function definitions, source references have to be handled
# differently 
.Call(depcache:::C_hash2, quote(function(){}))
[1] 58 0d 44 8e d4 fd 37 6f
.Call(depcache:::C_hash2, quote(\( ){  }))
[1] 58 0d 44 8e d4 fd 37 6f

 * ALTREP is ignored:

identical(1:10, 1:10+0L)
# [1] TRUE
identical(serialize(1:10, NULL), serialize(1:10+0L, NULL))
# [1] FALSE
identical(
 .Call(depcache:::C_hash2, 1:10),
 .Call(depcache:::C_hash2, 1:10+0L)
)
# [1] TRUE

 * Strings not marked as bytes are encoded into UTF-8:

identical('\uff', iconv('\uff', 'UTF-8', 'latin1'))
# [1] TRUE
identical(
 serialize('\uff', NULL),
 serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL)
)
# [1] FALSE
identical(
 .Call(depcache:::C_hash2, '\uff'),
 .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1'))
)
# [1] TRUE

 * NaNs with different payloads (except NA_numeric_) are replaced by
   R_NaN.

One of the many downsides to the current approach is that we rely on
the non-API entry point getPRIMNAME() in order to hash builtins.
Looking at the source code for identical() is no help here, because it
uses the private PRIMOFFSET macro.

The bitstream being hashed is also, unfortunately, not exactly
compatible with R serialization format version 2: I had to ignore the
LEVELS of the language objects being hashed both because identical()
seems to ignore those and because I was missing multiple private
definitions (e.g. the MAYBEJIT flag) to handle them properly.

Then there's also the problem of immediate bindings [**]: I've seen bits
of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that
are not safe to handle this way, but R_expand_binding_value() (used by
serialize()) is again a private function that is not accessible from
packages. identical() won't help here, because it compares reference
objects (which may or may not contain such immediate bindings) by their
pointer values instead of digging down into them.

Dropping the (already violated) requirement to be compatible with R
serialization bitstream will make it possible to simplify the code
further.

Finally:

a <- new.env()
b <- new.env()
a$x <- b$x <- 42
identical(a, b)
# [1] FALSE
.Call(depcache:::C_hash2, a)
# [1] 44 21 f1 36 5d 92 03 1b
.Call(depcache:::C_hash2, b)
# [1] 44 21 f1 36 5d 92 03 1b

...but that's unavoidable when looking at frozen object contents
instead of their live memory layout.

If you're interested, here's the development version of the package:
install.packages('depcache',contriburl='https://aitap.github.io/Rpackages')

-- 
Best regards,
Ivan

[*]
https://github.com/aitap/depcache/blob/serialize_canonical/src/serialize.c

[**]
https://svn.r-project.org/R/trunk/doc/notes/immbnd.md

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Dipterix Wang


> 
> We have one in vctrs but it's not exported:
> https://github.com/r-lib/vctrs/blob/main/src/hash.c
> 
> The main use is vectorised hashing:
> 

Thanks for showing me this function. I have read the source code. That's a 
great idea. 

However, I think I might have missed something. When I tried vctrs::obj_hash, I 
couldn't get identical outputs.


``` r
options(keep.source = TRUE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 68 e8 5a 0c
a <- function(){}
vctrs:::obj_hash(a)
#> [1] b2 6a 55 9c
a <-   function(){}
vctrs:::obj_hash(a)
#> [1] 01 a9 bc 30
options(keep.source = FALSE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 93 d7 f2 72
a <- function(){}
vctrs:::obj_hash(a)
#> [1] f3 1d d2 f4
```

Created on 2024-01-17 with [reprex v2.1.0](https://reprex.tidyverse.org)

> 
> Best,
> Lionel
> 
> On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
>  wrote:
>> 
>> I think one could implement hashing on the fly without any
>> serialization, similarly to how identical works, but I am not aware of
>> any existing implementation. Again, if that wasn't clear: I don't think
>> trying to compute a hash of an object from its serialized representation
>> is a good idea - it is of course convenient, but has problems like the
>> one you have ran into.
>> 
>> In some applications it may still be good enough: if by various tweaks,
>> such as ensuring source references are off in your case, you achieve a
>> state when false alarms are rare (identical objects have different
>> hashes), and hence say unnecessary re-computation is rare, maybe it is
>> good enough.

I really appreciate you answer my questions and solve my puzzles. I went back 
and read the R internal code for `serialize` and totally agree on this, that 
serialization is not a good idea for digesting R objects, especially on 
environments, expressions, and functions. 

What I want is a function that can produce the same and stable hash for 
identical objects. However, there is no function (given our best knowledge) on 
the market that can do this. `digest::digest` and `rlang::hash` are the first 
functions that come into my mind. Both are widely used, but they use serialize. 
The author of `digest` said:
> "As you know,  digest takes and (ahem) "digests" what serialize gives 
it, so you would have to look into what serialize lets you do."

vctrs:::obj_hash is probably the closest to the implementation of `identical`, 
but the above examples give different results for identical objects.

The existence of digest:: digest and rlang::hash shows that there is a huge 
demand for this "ideal" hash function. However, I bet most people are using 
digest/hash "incorrectly".

>> 
>> Tomas
>> 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Lionel Henry via R-devel
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation

We have one in vctrs but it's not exported:
https://github.com/r-lib/vctrs/blob/main/src/hash.c

The main use is vectorised hashing:

```
# Non-vectorised
vctrs:::obj_hash(1:10)
#> [1] 1e 77 ce 48

# Vectorised
vctrs:::vec_hash(1L)
#> [1] 70 a2 85 ef
vctrs:::vec_hash(1:2)
#> [1] 70 a2 85 ef bf 3c 2c cf

# vctrs semantics so dfs are vectors of rows
length(vctrs:::vec_hash(mtcars)) / 4
#> [1] 32
nrow(mtcars)
#> [1] 32
```

Best,
Lionel

On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
 wrote:
>
> On 1/16/24 20:16, Dipterix Wang wrote:
> > Could you recommend any packages/functions that compute hash such that
> > the source references and sexpinfo_struct are ignored? Basically a
> > version of `serialize` that convert R objects to raw without storing
> > the ancillary source reference and sexpinfo.
> > I think most people would think of `digest` but that package uses
> > `serialize` (see discussion
> > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
>
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation. Again, if that wasn't clear: I don't think
> trying to compute a hash of an object from its serialized representation
> is a good idea - it is of course convenient, but has problems like the
> one you have ran into.
>
> In some applications it may still be good enough: if by various tweaks,
> such as ensuring source references are off in your case, you achieve a
> state when false alarms are rare (identical objects have different
> hashes), and hence say unnecessary re-computation is rare, maybe it is
> good enough.
>
> Tomas
>
> >
> >> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
> >>  wrote:
> >>
> >>
> >> On 1/12/24 06:11, Dipterix Wang wrote:
> >>> Dear R devs,
> >>>
> >>> I was digging into a package issue today when I realized R serialize
> >>> function not always generate the same results on equivalent objects
> >>> when users choose to run differently. For example, the following code
> >>>
> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
> >>>
> >>> generates different results when I copy-paste into console vs when I
> >>> use ctrl+shift+enter to source the file in RStudio.
> >>>
> >>> With a deeper inspect into the cause, I found that function and
> >>> language get source reference when getOption("keep.source") is TRUE.
> >>> This means the source reference will make the functions different
> >>> while in most cases, whether keeping function source might not
> >>> impact how a function behaves.
> >>>
> >>> While it's OK that function serialize generates different results,
> >>> functions such as `rlang::hash` and `digest::digest`, which depend
> >>> on `serialize` might eventually deliver false positives on same
> >>> inputs. I've checked source code in digest package hoping to get
> >>> around this issue (for example serialize(..., refhook = ...)).
> >>> However, my workaround did not work. It seems that the markers to
> >>> the objects are different even if I used `refhook` to force srcref
> >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
> >>> None of them works directly on nested environments with multiple
> >>> functions.
> >>>
> >>> I wonder how hard it would be to have options to discard source when
> >>> serializing R objects?
> >>>
> >>> Currently my analyses heavily depend on digest function to generate
> >>> file caches and automatically schedule pipelines (to update cache)
> >>> when changes are detected. The pipelines save the hashes of source
> >>> code, inputs, and outputs together so other people can easily verify
> >>> the calculation without accessing the original data (which could be
> >>> sensitive), or running hour-long analyses, or having to buy servers.
> >>> All of these require `serialize` to produce the same results
> >>> regardless of how users choose to run the code.
> >>>
> >>> It would be great if this feature could be in the future R. Other
> >>> pipeline packages such as `targets` and `drake` can also benefit
> >>> from it.
> >>
> >> I don't think such functionality would belong to serialize(). This
> >> function is not meant to produce stable results based on the input,
> >> the serialized representation may even differ based on properties not
> >> seen by users.
> >>
> >> I think an option to ignore source code would belong to a function
> >> that computes the hash, as other options of identical().
> >>
> >> Tomas
> >>
> >>
> >>> Thanks,
> >>>
> >>> - Dipterix
> >>> [[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-devel@r-project.orgmailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> __
> R-devel@r-project.org 

Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Tomas Kalibera

On 1/16/24 20:16, Dipterix Wang wrote:
Could you recommend any packages/functions that compute hash such that 
the source references and sexpinfo_struct are ignored? Basically a 
version of `serialize` that convert R objects to raw without storing 
the ancillary source reference and sexpinfo.
I think most people would think of `digest` but that package uses 
`serialize` (see discussion 
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)


I think one could implement hashing on the fly without any 
serialization, similarly to how identical works, but I am not aware of 
any existing implementation. Again, if that wasn't clear: I don't think 
trying to compute a hash of an object from its serialized representation 
is a good idea - it is of course convenient, but has problems like the 
one you have ran into.


In some applications it may still be good enough: if by various tweaks, 
such as ensuring source references are off in your case, you achieve a 
state when false alarms are rare (identical objects have different 
hashes), and hence say unnecessary re-computation is rare, maybe it is 
good enough.


Tomas



On Jan 12, 2024, at 11:33 AM, Tomas Kalibera 
 wrote:



On 1/12/24 06:11, Dipterix Wang wrote:

Dear R devs,

I was digging into a package issue today when I realized R serialize 
function not always generate the same results on equivalent objects 
when users choose to run differently. For example, the following code


serialize(with(new.env(), { function(){} }), NULL, TRUE)

generates different results when I copy-paste into console vs when I 
use ctrl+shift+enter to source the file in RStudio.


With a deeper inspect into the cause, I found that function and 
language get source reference when getOption("keep.source") is TRUE. 
This means the source reference will make the functions different 
while in most cases, whether keeping function source might not 
impact how a function behaves.


While it's OK that function serialize generates different results, 
functions such as `rlang::hash` and `digest::digest`, which depend 
on `serialize` might eventually deliver false positives on same 
inputs. I've checked source code in digest package hoping to get 
around this issue (for example serialize(..., refhook = ...)). 
However, my workaround did not work. It seems that the markers to 
the objects are different even if I used `refhook` to force srcref 
to be the same. I also tried `removeSource` and `rlang::zap_srcref`. 
None of them works directly on nested environments with multiple 
functions.


I wonder how hard it would be to have options to discard source when 
serializing R objects?


Currently my analyses heavily depend on digest function to generate 
file caches and automatically schedule pipelines (to update cache) 
when changes are detected. The pipelines save the hashes of source 
code, inputs, and outputs together so other people can easily verify 
the calculation without accessing the original data (which could be 
sensitive), or running hour-long analyses, or having to buy servers. 
All of these require `serialize` to produce the same results 
regardless of how users choose to run the code.


It would be great if this feature could be in the future R. Other 
pipeline packages such as `targets` and `drake` can also benefit 
from it.


I don't think such functionality would belong to serialize(). This 
function is not meant to produce stable results based on the input, 
the serialized representation may even differ based on properties not 
seen by users.


I think an option to ignore source code would belong to a function 
that computes the hash, as other options of identical().


Tomas



Thanks,

- Dipterix
[[alternative HTML version deleted]]

__
R-devel@r-project.orgmailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-16 Thread Dipterix Wang
Could you recommend any packages/functions that compute hash such that the 
source references and sexpinfo_struct are ignored? Basically a version of 
`serialize` that convert R objects to raw without storing the ancillary source 
reference and sexpinfo.

I think most people would think of `digest` but that package uses `serialize` 
(see discussion 
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)

> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera  wrote:
> 
> 
> On 1/12/24 06:11, Dipterix Wang wrote:
>> Dear R devs,
>> 
>> I was digging into a package issue today when I realized R serialize 
>> function not always generate the same results on equivalent objects when 
>> users choose to run differently. For example, the following code
>> 
>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
>> 
>> generates different results when I copy-paste into console vs when I use 
>> ctrl+shift+enter to source the file in RStudio.
>> 
>> With a deeper inspect into the cause, I found that function and language get 
>> source reference when getOption("keep.source") is TRUE. This means the 
>> source reference will make the functions different while in most cases, 
>> whether keeping function source might not impact how a function behaves.
>> 
>> While it's OK that function serialize generates different results, functions 
>> such as `rlang::hash` and `digest::digest`, which depend on `serialize` 
>> might eventually deliver false positives on same inputs. I've checked source 
>> code in digest package hoping to get around this issue (for example 
>> serialize(..., refhook = ...)). However, my workaround did not work. It 
>> seems that the markers to the objects are different even if I used `refhook` 
>> to force srcref to be the same. I also tried `removeSource` and 
>> `rlang::zap_srcref`. None of them works directly on nested environments with 
>> multiple functions.
>> 
>> I wonder how hard it would be to have options to discard source when 
>> serializing R objects?
>> 
>> Currently my analyses heavily depend on digest function to generate file 
>> caches and automatically schedule pipelines (to update cache) when changes 
>> are detected. The pipelines save the hashes of source code, inputs, and 
>> outputs together so other people can easily verify the calculation without 
>> accessing the original data (which could be sensitive), or running hour-long 
>> analyses, or having to buy servers. All of these require `serialize` to 
>> produce the same results regardless of how users choose to run the code.
>> 
>> It would be great if this feature could be in the future R. Other pipeline 
>> packages such as `targets` and `drake` can also benefit from it.
> 
> I don't think such functionality would belong to serialize(). This function 
> is not meant to produce stable results based on the input, the serialized 
> representation may even differ based on properties not seen by users.
> 
> I think an option to ignore source code would belong to a function that 
> computes the hash, as other options of identical().
> 
> Tomas
> 
> 
>> Thanks,
>> 
>> - Dipterix
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-devel@r-project.org  mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-12 Thread Tomas Kalibera



On 1/12/24 06:11, Dipterix Wang wrote:

Dear R devs,

I was digging into a package issue today when I realized R serialize function 
not always generate the same results on equivalent objects when users choose to 
run differently. For example, the following code

serialize(with(new.env(), { function(){} }), NULL, TRUE)

generates different results when I copy-paste into console vs when I use 
ctrl+shift+enter to source the file in RStudio.

With a deeper inspect into the cause, I found that function and language get source 
reference when getOption("keep.source") is TRUE. This means the source 
reference will make the functions different while in most cases, whether keeping function 
source might not impact how a function behaves.

While it's OK that function serialize generates different results, functions 
such as `rlang::hash` and `digest::digest`, which depend on `serialize` might 
eventually deliver false positives on same inputs. I've checked source code in 
digest package hoping to get around this issue (for example serialize(..., 
refhook = ...)). However, my workaround did not work. It seems that the markers 
to the objects are different even if I used `refhook` to force srcref to be the 
same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works 
directly on nested environments with multiple functions.

I wonder how hard it would be to have options to discard source when 
serializing R objects?

Currently my analyses heavily depend on digest function to generate file caches 
and automatically schedule pipelines (to update cache) when changes are 
detected. The pipelines save the hashes of source code, inputs, and outputs 
together so other people can easily verify the calculation without accessing 
the original data (which could be sensitive), or running hour-long analyses, or 
having to buy servers. All of these require `serialize` to produce the same 
results regardless of how users choose to run the code.

It would be great if this feature could be in the future R. Other pipeline 
packages such as `targets` and `drake` can also benefit from it.


I don't think such functionality would belong to serialize(). This 
function is not meant to produce stable results based on the input, the 
serialized representation may even differ based on properties not seen 
by users.


I think an option to ignore source code would belong to a function that 
computes the hash, as other options of identical().


Tomas



Thanks,

- Dipterix
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-12 Thread Ivan Krylov via R-devel
В Fri, 12 Jan 2024 00:11:45 -0500
Dipterix Wang  пишет:

> I wonder how hard it would be to have options to discard source when
> serializing R objects? 

> Currently my analyses heavily depend on digest function to generate
> file caches and automatically schedule pipelines (to update cache)
> when changes are detected.

Source references may be the main problem here, but not the only one.
There are also string encodings and function bytecode (which may or may
not be present and probably changes between R versions). I've been
collecting the ways that the objects that are identical() to each other
can serialize() differently in my package 'depcache'; I'm sure I missed
a few.

Admittedly, string encodings are less important nowadays (except on
older Windows and weirdly set up Unix-like systems). Thankfully, the
digest package already knows to skip the serialization header (which
contains the current version of R).

serialize() only knows about basic types [*], and source references are
implemented on top of these as objects of class 'srcref'. Sometimes
they are attached as attributes to other objects, other times (e.g. in
quote(function(){}), [**]) just sitting there as arguments to a call.

Sometimes you can hash the output of deparse(x) instead of serialize(x)
[***]. Text representations aren't without their own problems (e.g.
IEEE floating-point numbers not being representable as decimal
fractions), but at least deparsing both ignores the source references
and punts the encoding problem to the abstraction layer above it:
deparse() is the same for both '\uff' and iconv('\uff', 'UTF-8',
'latin1'): just "ÿ".

Unfortunately, this doesn't solve the environment problem. For these,
you really need a way to canonicalize the reference-semantics objects
before serializing them without changing the originals, even in cases
like a <- new.env(); b <- new.env(); a$x <- b; b$x <- a. I'm not sure
that reference hooks can help with that. In order to implement it
properly, the fixup process will have to rely on global state and keep
weak references to the environments it visits and creates shadow copies
of.

I think it's not impossible to implement
serialize_to_canonical_representation() for an R package, but it will
be a lot of work to decide which parts are canonical and which should
be discarded.

-- 
Best regards,
Ivan

[*]
https://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats

[**]
https://bugs.r-project.org/show_bug.cgi?id=18638

[***]
https://stat.ethz.ch/pipermail/r-devel/2023-March/082505.html

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-11 Thread Dipterix Wang
Dear R devs,

I was digging into a package issue today when I realized R serialize function 
not always generate the same results on equivalent objects when users choose to 
run differently. For example, the following code

serialize(with(new.env(), { function(){} }), NULL, TRUE)

generates different results when I copy-paste into console vs when I use 
ctrl+shift+enter to source the file in RStudio. 

With a deeper inspect into the cause, I found that function and language get 
source reference when getOption("keep.source") is TRUE. This means the source 
reference will make the functions different while in most cases, whether 
keeping function source might not impact how a function behaves.

While it's OK that function serialize generates different results, functions 
such as `rlang::hash` and `digest::digest`, which depend on `serialize` might 
eventually deliver false positives on same inputs. I've checked source code in 
digest package hoping to get around this issue (for example serialize(..., 
refhook = ...)). However, my workaround did not work. It seems that the markers 
to the objects are different even if I used `refhook` to force srcref to be the 
same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works 
directly on nested environments with multiple functions. 

I wonder how hard it would be to have options to discard source when 
serializing R objects? 

Currently my analyses heavily depend on digest function to generate file caches 
and automatically schedule pipelines (to update cache) when changes are 
detected. The pipelines save the hashes of source code, inputs, and outputs 
together so other people can easily verify the calculation without accessing 
the original data (which could be sensitive), or running hour-long analyses, or 
having to buy servers. All of these require `serialize` to produce the same 
results regardless of how users choose to run the code.

It would be great if this feature could be in the future R. Other pipeline 
packages such as `targets` and `drake` can also benefit from it.

Thanks,

- Dipterix
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel