Re: [Rd] Making headers self-contained for static analysis

2023-03-16 Thread Lionel Henry via R-devel
People have let me know that the attachment didn't make it through.
Do patches get filtered out?

Please find it there:
https://github.com/lionel-/r-svn/commit/e3de56798b1321a3fa8688a42bbb73d763b78024.patch

I'm also happy to post it on the bugzilla if that makes sense.

Best,
Lionel

On 3/16/23, Lionel Henry  wrote:
> Hello,
>
> I started using clangd to get better static analysis and code
> refactoring tooling with the R sources (using eglot-mode in Emacs, it
> just works once you've generated a `compile_commands.json` file with
> `bear make all`). I noticed that the static analyser can't understand
> several header files because these are not self-contained. So I went
> through all .h files and inserted the missing includes, cf the
> attached patch.
>
> Making the headers self-contained has consequences for the .c or .cpp
> files that include them. In the case of C files, the only downside I
> see is that it might cause users to accidentally rely on indirect
> inclusion of standard headers, instead of directly including the
> header to make the dependency explicit as would be good practice.
> This doesn't seem like a big deal compared to the benefits of enabling
> static analysis.
>
> However in the case of C++ that's more problematic. We don't want to
> include the C headers because that would pollute the global namespace
> and users might prefer to import the missing symbols (`size_t` and
> `FILE`) selectively. Also that wouldn't help static analysis within
> the header files since the analysers use the C path. So I have guarded
> inclusion of standard C headers behing a `__cplusplus` check.
>
> If that makes sense, would R core consider applying the attached patch
> to the R sources?
>
> Best,
> Lionel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Making headers self-contained for static analysis

2023-03-16 Thread Lionel Henry via R-devel
Hello,

I started using clangd to get better static analysis and code
refactoring tooling with the R sources (using eglot-mode in Emacs, it
just works once you've generated a `compile_commands.json` file with
`bear make all`). I noticed that the static analyser can't understand
several header files because these are not self-contained. So I went
through all .h files and inserted the missing includes, cf the
attached patch.

Making the headers self-contained has consequences for the .c or .cpp
files that include them. In the case of C files, the only downside I
see is that it might cause users to accidentally rely on indirect
inclusion of standard headers, instead of directly including the
header to make the dependency explicit as would be good practice.
This doesn't seem like a big deal compared to the benefits of enabling
static analysis.

However in the case of C++ that's more problematic. We don't want to
include the C headers because that would pollute the global namespace
and users might prefer to import the missing symbols (`size_t` and
`FILE`) selectively. Also that wouldn't help static analysis within
the header files since the analysers use the C path. So I have guarded
inclusion of standard C headers behing a `__cplusplus` check.

If that makes sense, would R core consider applying the attached patch
to the R sources?

Best,
Lionel
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] isNamespaceLoaded() while the namespace is loading

2023-03-16 Thread Lionel Henry via R-devel
Hello,

We've run into this issue multiple times and it's often a head
scratcher when it happens. We are using workarounds but it would be
great to fix this for R 4.3. Would an R core member have time to
review the patch that we supplied in
https://bugs.r-project.org/show_bug.cgi?id=18489 ?

Best,
Lionel


On 1/21/22, Gábor Csárdi  wrote:
> We ran into a bug in our package that seems to boil down to
> isNamespaceLoaded() returning TRUE for namespaces that R is currently
> loading.
>
> We had something like this in an .onLoad function:
>
> if (isNamespaceLoaded("upstream")) {
>   upstream::upstream_function()
> }
>
> Which seems OK, unless upstream (recursively) imports the package
> having this code. Should that happen, the loading of upstream triggers
> the loading of this package as well and isNamespaceLoaded() seems to
> return TRUE, even though the namespace of upstream is not fully loaded
> yet, and we get an error that looks like this:
>
> Error : .onLoad failed in loadNamespace() for 'foo', details:
>  call: NULL
>  error: 'upstream_function' is not an exported object from
> 'namespace:upstream'
>
> I wonder if isNamespaceLoaded() returning TRUE is correct in this
> case, or returning FALSE would be better. Or maybe it would make sense
> to have a way to query the packages that are being loaded currently?
>
> AFAICT this works, but it does use some implementation details from
> loadNamespace(), so it does not seem like a proper solution:
>
> dynGet("__NameSpacesLoading__", NULL)
>
> Another workaround is something like this:
>
>   is_loaded <- function(pkg) {
> if (!isNamespaceLoaded(pkg)) return(FALSE)
> tryCatch({
>   loadNamespace(pkg)
>   TRUE
> }, error = function(err) FALSE)
>   }
>
> which forces an error for currently loading namespaces by triggering a
> (fake) recursive dependency.
>
> Thanks,
> Gabor
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] removeSource() vs. function literals

2023-03-31 Thread Lionel Henry via R-devel
If you can afford a dependency on rlang, `rlang::zap_srcref()` deals
with this. It's recursive over expression vectors, calls (including
calls to `function` and their hidden srcref arg), and function
objects. It's implemented in C for efficiency as we found it to be a
bottleneck in some applications (IIRC caching). I'd be happy to
upstream this in base if R core is interested.

Best,
Lionel


On 3/30/23, Duncan Murdoch  wrote:
> On 30/03/2023 10:32 a.m., Ivan Krylov wrote:
>> Dear R-devel,
>>
>> In a package of mine, I use removeSource on expression objects in order
>> to make expressions that are semantically the same serialize to the
>> same byte sequences:
>> https://github.com/cran/depcache/blob/854d68a/R/fixup.R#L8-L34
>>
>> Today I learned that expressions containing function definitions also
>> contain the source references for the functions, not as an attribute,
>> but as a separate argument to the `function` call:
>>
>> str(quote(function() NULL)[[4]])
>> # 'srcref' int [1:8] 1 11 1 25 11 25 1 1
>> # - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'
>> #   
>>
>> This means that removeSource() on an expression that would define a
>> function when evaluated doesn't actually remove the source reference
>> from the object.
>>
>> Do you think it would be appropriate to teach removeSource() to remove
>> such source references? What could be a good way to implement that?
>> if (is.call(fn) && identical(fn[[1]], 'function')) fn[[4]] <- NULL
>> sounds too arbitrary. if (inherits(fn, 'srcref')) return(NULL) sounds
>> too broad.
>>
>
> I don't think there's a simple way to do that.  Functions can define
> functions within themselves.  If you're talking about code that was
> constructed by messing with language objects, it could contain both
> function objects and calls to `function` to construct them.  You'd need
> to recurse through all expressions in the object.  Some of those
> expressions might be environments, so your changes could leak out of the
> function you're working on.
>
> Things are simpler if you know the expression is the unmodified result
> of parsing source code, but if you know that, wouldn't you usually be
> able to control things by setting keep.source = FALSE?
>
> Maybe a workable solution is something like parse(deparse(expr, control
> = "exact"), keep.source = FALSE).  Wouldn't work on environments or
> various exotic types, but would probably warn you if it wasn't working.
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Lionel Henry via R-devel
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation

We have one in vctrs but it's not exported:
https://github.com/r-lib/vctrs/blob/main/src/hash.c

The main use is vectorised hashing:

```
# Non-vectorised
vctrs:::obj_hash(1:10)
#> [1] 1e 77 ce 48

# Vectorised
vctrs:::vec_hash(1L)
#> [1] 70 a2 85 ef
vctrs:::vec_hash(1:2)
#> [1] 70 a2 85 ef bf 3c 2c cf

# vctrs semantics so dfs are vectors of rows
length(vctrs:::vec_hash(mtcars)) / 4
#> [1] 32
nrow(mtcars)
#> [1] 32
```

Best,
Lionel

On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
 wrote:
>
> On 1/16/24 20:16, Dipterix Wang wrote:
> > Could you recommend any packages/functions that compute hash such that
> > the source references and sexpinfo_struct are ignored? Basically a
> > version of `serialize` that convert R objects to raw without storing
> > the ancillary source reference and sexpinfo.
> > I think most people would think of `digest` but that package uses
> > `serialize` (see discussion
> > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
>
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation. Again, if that wasn't clear: I don't think
> trying to compute a hash of an object from its serialized representation
> is a good idea - it is of course convenient, but has problems like the
> one you have ran into.
>
> In some applications it may still be good enough: if by various tweaks,
> such as ensuring source references are off in your case, you achieve a
> state when false alarms are rare (identical objects have different
> hashes), and hence say unnecessary re-computation is rare, maybe it is
> good enough.
>
> Tomas
>
> >
> >> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
> >>  wrote:
> >>
> >>
> >> On 1/12/24 06:11, Dipterix Wang wrote:
> >>> Dear R devs,
> >>>
> >>> I was digging into a package issue today when I realized R serialize
> >>> function not always generate the same results on equivalent objects
> >>> when users choose to run differently. For example, the following code
> >>>
> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
> >>>
> >>> generates different results when I copy-paste into console vs when I
> >>> use ctrl+shift+enter to source the file in RStudio.
> >>>
> >>> With a deeper inspect into the cause, I found that function and
> >>> language get source reference when getOption("keep.source") is TRUE.
> >>> This means the source reference will make the functions different
> >>> while in most cases, whether keeping function source might not
> >>> impact how a function behaves.
> >>>
> >>> While it's OK that function serialize generates different results,
> >>> functions such as `rlang::hash` and `digest::digest`, which depend
> >>> on `serialize` might eventually deliver false positives on same
> >>> inputs. I've checked source code in digest package hoping to get
> >>> around this issue (for example serialize(..., refhook = ...)).
> >>> However, my workaround did not work. It seems that the markers to
> >>> the objects are different even if I used `refhook` to force srcref
> >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
> >>> None of them works directly on nested environments with multiple
> >>> functions.
> >>>
> >>> I wonder how hard it would be to have options to discard source when
> >>> serializing R objects?
> >>>
> >>> Currently my analyses heavily depend on digest function to generate
> >>> file caches and automatically schedule pipelines (to update cache)
> >>> when changes are detected. The pipelines save the hashes of source
> >>> code, inputs, and outputs together so other people can easily verify
> >>> the calculation without accessing the original data (which could be
> >>> sensitive), or running hour-long analyses, or having to buy servers.
> >>> All of these require `serialize` to produce the same results
> >>> regardless of how users choose to run the code.
> >>>
> >>> It would be great if this feature could be in the future R. Other
> >>> pipeline packages such as `targets` and `drake` can also benefit
> >>> from it.
> >>
> >> I don't think such functionality would belong to serialize(). This
> >> function is not meant to produce stable results based on the input,
> >> the serialized representation may even differ based on properties not
> >> seen by users.
> >>
> >> I think an option to ignore source code would belong to a function
> >> that computes the hash, as other options of identical().
> >>
> >> Tomas
> >>
> >>
> >>> Thanks,
> >>>
> >>> - Dipterix
> >>> [[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-devel@r-project.orgmailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> __
> R-devel@r-project.org