Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-18 Thread Charlie Gao via R-devel
> --
> 
> Date: Wed, 17 Jan 2024 11:35:02 -0500
> 
> From: Dipterix Wang 
> 
> To: Lionel Henry , Tomas Kalibera
> 
>  
> 
> Cc: r-devel@r-project.org
> 
> Subject: Re: [Rd] Choices to remove `srcref` (and its buddies) when
> 
>  serializing objects
> 
> Message-ID: <3cf4ca2d-9f72-4c7b-90aa-4d2e9f745...@gmail.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> > 
> > 
> >  
> > 
> >  On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
> > 
> >   wrote:
> > 
> > > 
> > > I think one could implement hashing on the fly without any
> > > 
> > >  serialization, similarly to how identical works, but I am not aware of
> > > 
> > >  any existing implementation. Again, if that wasn't clear: I don't think
> > > 
> > >  trying to compute a hash of an object from its serialized representation
> > > 
> > >  is a good idea - it is of course convenient, but has problems like the
> > > 
> > >  one you have ran into.
> > > 
> > >  
> > > 
> > >  In some applications it may still be good enough: if by various tweaks,
> > > 
> > >  such as ensuring source references are off in your case, you achieve a
> > > 
> > >  state when false alarms are rare (identical objects have different
> > > 
> > >  hashes), and hence say unnecessary re-computation is rare, maybe it is
> > > 
> > >  good enough.
> > >
> > 
> 
> I really appreciate you answer my questions and solve my puzzles. I went back 
> and read the R internal code for `serialize` and totally agree on this, that 
> serialization is not a good idea for digesting R objects, especially on 
> environments, expressions, and functions. 
> 
> What I want is a function that can produce the same and stable hash for 
> identical objects. However, there is no function (given our best knowledge) 
> on the market that can do this. `digest::digest` and `rlang::hash` are the 
> first functions that come into my mind. Both are widely used, but they use 
> serialize. The author of `digest` said:
> 
>  > "As you know, digest takes and (ahem) "digests" what serialize gives it, 
> so you would have to look into what serialize lets you do."
> 
> vctrs:::obj_hash is probably the closest to the implementation of 
> `identical`, but the above examples give different results for identical 
> objects.
> 
> The existence of digest:: digest and rlang::hash shows that there is a huge 
> demand for this "ideal" hash function. However, I bet most people are using 
> digest/hash "incorrectly".

Please read the full discussion to this old bug report: 
https://bugs.r-project.org/show_bug.cgi?id=18178

Quoting briefly: Serialization is not intended to be used this way. What 
serialization tries to provide is that x and unserialize(serialize(x, NULL)) 
will be identical() while preserving internal representation where possible. 
Two objects that are considered identical() can have very different internal 
representations, and their serializations will reflect this.

You will see that it is not as simple as just removing the srcref or the 
bytecode to functions. The issue with the `identical()` function in that 
context was eventually patched, but the comment by R-Core that serialization is 
not intended to be used to produce a reliable hash stands. Use of `identical()` 
or `serialize()` is simply not designed to ensure the same hashable object (in 
terms of bytes).

This is echoed by Tomas' comment above. But we note that it is 'good enough' in 
most cases.

Fwiw `nanonext::sha256()` and family directly hashes character strings and raw 
objects, but uses the same approach as `digest::digest()` elsewhere. So if 
someone comes up with a canonical binary representation of R objects, it will 
be able to hash it reliably.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug report: parLapply with capture.output(type="message") produces an error

2023-10-07 Thread Charlie Gao via R-devel
Hi Travers,

This is an implementation detail for background workers in general, in that 
there must be some robust way for them to exit (either upon a signal from the 
main session, or if the main session ends / socket disconnects). As these are 
background workers, their error messages are usually not seen, and hence it has 
been deemed good enough that they exit in this case through error. However, you 
do see them in your case as you have diverted the message stream as Henrik has 
highlighted. This may be inconvenient, but can safely be ignored.

If however, clean output is important in your use case, there is a new solution 
that has only just become available. This is a direct outcome of the R Project 
Sprint in Warwick from a month ago – Luke Tierney has actually opened up the 
`parallel` package to allow other packages to provide alternative 
communications backends. Only possible with R-devel, but as of yesterday a new 
version of the `mirai` package was released to CRAN that provides one such 
backend.  

You would simply replace your `makeCluster()` call with 
`mirai::make_cluster()`. That’s the only change.

As this is the R-devel mailing list, I will not go into the details of this 
particular implementation, but it seems useful for users of `parallel` to know 
that this is now possible. As author of `mirai`, please reach out directly with 
questions on the package rather than replying on the list.

I just want to highlight one other possibility - if you remove 
`capture.output()` in your evaluation and call `mirai::make_cluster(2, output = 
TRUE)` instead, you will then be able to see all the messages from the 
background workers in your main process. It’s probably not what you’re after, 
but just in case.

Thanks,

Charlie

6 October 2023 at 12:04, Travers Ching  wrote:

> 
> Hi Henrik,
> 
> Thanks for the detailed technical explanation! I ended up using the
> withCallingHandlers solution to achieve what I needed (thanks to stack
> overflow). If this is not technically a bug I think it is unintuitive and
> unexpected behavior from a user perspective. So take this as a feature
> request rather than a bug report.
> 
> The error message at the end of the script doesn't inform the user what
> part of the script is wrong (using sink or capture.output in parallel). It
> is difficult to understand what's going on.
> 
> The "correct" solution using withCallingHandlers is esoteric, and I think
> most users would not code that up naturally much less understand what it is
> doing. Could capture.output(type="messages") be rewritten using this
> approach?
> 
> Lastly, the help file for stopCluster says
> 
> "the workers will terminate themselves once the socket on which they are
> listening for commands becomes unavailable, which it should if the master R
> session is completed"
> 
> To me, this implies that I shouldn't need to call stopCluster and that the
> workers are automatically stopped at the end. The place where I first saw
> the error was using future_lapply and following the vignette there's no
> call to stopCluster there either.
> 
> Best,
> Travers
> 
> On Thu, Oct 5, 2023 at 6:15 PM Henrik Bengtsson 
> wrote:
> 
> > 
> > This is actually not a bug. If we really want to identify a bug, then
> >  it's actually a bug in your code. We'll get to that at the very end.
> >  Either way, it's an interesting report that reveals a lot of things.
> > 
> >  First, here's a slightly simpler version of your example:
> > 
> >  $ Rscript --vanilla -e 'library(parallel); cl <- makeCluster(1); x <-
> >  clusterEvalQ(cl, { capture.output(NULL, type = "message") })'
> >  Error in unserialize(node$con) : error reading from connection
> >  Calls:  ... doTryCatch -> recvData -> recvData.SOCKnode ->
> >  unserialize
> >  Execution halted
> > 
> >  There are lots of things going on here, but before we get to the
> >  answer, the most important take-home message here is:
> > 
> >  Never ever use capture.output(..., type = "message") in R.
> > 
> >  Second comment is:
> > 
> >  No, really, do not do that!
> > 
> >  Now, towards what is going on in your example. First, I don't think
> >  help("capture.output") is too "kind" here, when it says:
> > 
> >  'Messages sent to stderr() (including those from message, warning and
> >  stop) are captured by type = "message". Note that this can be “unsafe”
> >  and should only be used with care.'
> > 
> >  To understand why you shouldn't do this, you have to know that
> >  capture.output() uses sink() internally, and its help page says:
> > 
> >  "Sink-ing the messages stream should be done only with great care. For
> >  that stream file must be an already open connection, and there is no
> >  stack of connections."
> > 
> >  The "[When] Sink-ing the messages stream ... there is no stack of
> >  connections" is the reason for your the problem you're experiencing.
> >  What happens is that, the background workers that you launch with
> >  parallel::makeCluster() will use 

Re: [Rd] Question on non-blocking socket

2023-02-16 Thread Charlie Gao via R-devel
> Date: Wed, 15 Feb 2023 01:24:26 +0100
> From: Ben Engbers 
> To: r-devel@r-project.org
> Subject: [Rd] Question on non-blocking socket
> Message-ID: <68ce63b0-7e91-6372-6926-59f3fcfff...@be-logical.nl>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
> 
> Hi,
> 
> December 27, 2021 I started a thread asking for help troubleshooting 
> non-blocking sockets.
> While developing the RBaseX client, I had issues with the authentication 
> process. It eventually turned out that a short break had to be inserted 
> in this process between sending the credentials to the server and 
> requesting the status. Tomas Kalibera put me on the right track by 
> drawing my attention to the 'socketSelect' function. I don't know 
> exactly the purpose of this function is (the function itself is 
> documented, but I can't find any information for which situations this 
> function should be called.) but it sufficed to call this function once 
> between sending and requesting.
> 
> I have two questions.
> The first is where I can find R documentation on proper use of 
> non-blocking sockets and on the proper use of the socketSelect function?
> 
> The second question is more focused on using non-blocking sockets in 
> general. Is it allowed to execute a read and a receive command 
> immediately after each other or must a short waiting loop be built in.
> I'm asking this because I'm running into the same problems in a C++ 
> project as I did with RBaseX.
> 
> Ben Engbers
> 

Hi Ben,

For an easier experience with sockets, you may wish to have a look at the 
`nanonext` package. This wraps 'NNG' and is generally used for messaging over 
its own protocols (req/rep, pub/sub etc.), although you can also use it for 
HTTP and websockets.

In any case, a low level stream interface allows connecting with arbitrary 
sockets. Using something like `s <- stream(dial = "tcp://0.0.0.0:")` 
substituting in the actual address. This would allow you greater flexibility in 
sending and receiving over the bytestream without worrying so much about order 
and timing as per your current experience.

For example, a common pattern this allows for is doing an async receive `r <- 
recv_aio(s)`  before sending a request `send(s, "some request")`, and then 
query the receive result afterwards at `r$data`.

I won't go into too much detail here, but as it is my own package, please feel 
free to reach out separately via email or github etc.

Thanks,

Charlie

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R_GetCurrentEnv() not working as intended

2022-11-14 Thread Charlie Gao via R-devel
Hi Lionel,

I had indeed seen your Bugzilla, but must have misread the R source as I 
thought it had already been adopted.

Thanks for sharing the workaround as well, it is interesting. As I can pass in 
`environment()` to my `.Call()`, I suspect there is not much difference given 
the call to `Rf_eval()` at the end of the workaround.

Let's hope your patch gets reviewed and adopted.

Thanks,

Charlie

November 14, 2022 8:55 AM, "Lionel Henry"  wrote:

> Hello,
> 
> This function currently does not work when called from `.Call()`.
> This is reported with a patch at
> https://bugs.r-project.org/show_bug.cgi?id=17839
> 
> In the meantime, you can use this stopgap implementation:
> 
> https://github.com/tidyverse/purrr/blob/55c9a8ab8788d878ce9e8e80b867139e46d15395/src/conditions.c#L6
> L34
> 
> Best,
> Lionel
> 
> On 11/13/22, Charlie Gao via R-devel  wrote:
> 
>> Perhaps my original question was too complicated, so I will just ask: is
>> anyone using R_GetCurrentEnv() in their C code? If so, grateful if you could
>> point me to an example where it is working for you.
>> 
>> I have searched Github and only come across a couple of trivial uses as an
>> argument to Rf_eval(), where it probably returns the global environment,
>> with the result being indistinguishable in normal use.
>> 
>> Thanks,
>> 
>> Charlie
>> 
>> October 22, 2022 12:52 AM, "Charlie Gao" 
>> wrote:
>> 
>>> Dear all,
>>> 
>>> I am attempting to use `R_GetCurrentEnv()` to return the current
>>> environment within C code, but it
>>> seems to always return the global environment.
>>> 
>>> Specifically, I would like to use it as an argument to R_NewEnv() so it is
>>> created with the correct
>>> enclosing environment. I also have functions in the environment that
>>> reference symbols in the
>>> closure and I would also like to use `R_GetCurrentEnv()` as an argument to
>>> `SET_CLOENV()`.
>>> 
>>> My workaround at the moment is to pass `environment()` as one of the
>>> arguments to the `.Call()`.
>>> For the actual code I am referring to:
>>> 
>>> https://github.com/shikokuchuo/nanonext/blob/main/src/aio.c#L516-L535
>>> 
>>> where I am currently passing `environment()` as 'clo' whereas ideally I
>>> would be able to use
>>> `R_GetCurrentEnv()` instead.
>>> 
>>> There is an open Bugzilla report from 2020 that says `R_GetCurrentEnv()`
>>> only returns the base
>>> namespace from within a `.Call()`, however I see that the proposed patch
>>> has already been adopted
>>> in the R source.
>>> 
>>> It seems that the function was introduced (fairly) recently in R 3.6,
>>> presumably for such uses. I
>>> would like to know if this is not the case or else confirmation that this
>>> is an outstanding bug.
>>> 
>>> Thanks,
>>> 
>>> Charlie
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R_GetCurrentEnv() not working as intended

2022-11-13 Thread Charlie Gao via R-devel
Perhaps my original question was too complicated, so I will just ask: is anyone 
using R_GetCurrentEnv() in their C code? If so, grateful if you could point me 
to an example where it is working for you.

I have searched Github and only come across a couple of trivial uses as an 
argument to Rf_eval(), where it probably returns the global environment, with 
the result being indistinguishable in normal use.

Thanks,

Charlie

October 22, 2022 12:52 AM, "Charlie Gao"  wrote:

> Dear all,
> 
> I am attempting to use `R_GetCurrentEnv()` to return the current environment 
> within C code, but it
> seems to always return the global environment.
> 
> Specifically, I would like to use it as an argument to R_NewEnv() so it is 
> created with the correct
> enclosing environment. I also have functions in the environment that 
> reference symbols in the
> closure and I would also like to use `R_GetCurrentEnv()` as an argument to 
> `SET_CLOENV()`.
> 
> My workaround at the moment is to pass `environment()` as one of the 
> arguments to the `.Call()`.
> For the actual code I am referring to:
> 
> https://github.com/shikokuchuo/nanonext/blob/main/src/aio.c#L516-L535
> 
> where I am currently passing `environment()` as 'clo' whereas ideally I would 
> be able to use
> `R_GetCurrentEnv()` instead.
> 
> There is an open Bugzilla report from 2020 that says `R_GetCurrentEnv()` only 
> returns the base
> namespace from within a `.Call()`, however I see that the proposed patch has 
> already been adopted
> in the R source.
> 
> It seems that the function was introduced (fairly) recently in R 3.6, 
> presumably for such uses. I
> would like to know if this is not the case or else confirmation that this is 
> an outstanding bug.
> 
> Thanks,
> 
> Charlie

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R_GetCurrentEnv() not working as intended

2022-10-22 Thread Charlie Gao via R-devel

Dear all,

I am attempting to use `R_GetCurrentEnv()` to return the current 
environment within C code, but it seems to always return the global 
environment.


Specifically, I would like to use it as an argument to R_NewEnv() so it 
is created with the correct enclosing environment. I also have functions 
in the environment that reference symbols in the closure and I would 
also like to use `R_GetCurrentEnv()` as an argument to `SET_CLOENV()`.


My workaround at the moment is to pass `environment()` as one of the 
arguments to the `.Call()`. For the actual code I am referring to:


https://github.com/shikokuchuo/nanonext/blob/main/src/aio.c#L516-L535

where I am currently passing `environment()` as 'clo' whereas ideally I 
would be able to use `R_GetCurrentEnv()` instead.


There is an open Bugzilla report from 2020 that says `R_GetCurrentEnv()` 
only returns the base namespace from within a `.Call()`, however I see 
that the proposed patch has already been adopted in the R source.


It seems that the function was introduced (fairly) recently in R 3.6, 
presumably for such uses. I would like to know if this is not the case 
or else confirmation that this is an outstanding bug.


Thanks,

Charlie

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel