Re: what was VMS do here? (was [perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input)

Craig A. Berry Fri, 02 Mar 2012 09:11:15 -0800

On Mar 2, 2012, at 9:15 AM, Nicholas Clark wrote:

> On Thu, Mar 01, 2012 at 05:17:27PM -0600, Craig A. Berry wrote:
>> 
>> On Mar 1, 2012, at 12:30 PM, Nicholas Clark wrote:
>>> 
>>> Specifically, the code is emulated on "everything else", but intended to
>>> do something real and useful on VMS:
>>> 
>>> #ifdef VMS
>>>   /* VMS wants read instead of fread, because fread doesn't respect */
>>>   /* RMS record boundaries. This is not necessarily a good thing to be */
>>>   /* doing, but we've got no other real choice - except avoid stdio
>>>      as implementation - perhaps write a :vms layer ?
>>>   */
>>>   fd = PerlIO_fileno(fp);
>>>   if (fd != -1) {
>>>     bytesread = PerlLIO_read(fd, buffer, recsize);
>>>   }
>>>   else /* in-memory file from PerlIO::Scalar */
>>> #endif
>> 
>> I don't think this code is as meaningful as it used to be since unix I/O is 
>> the bottom layer for PerlIO now.  Which means that PerlLIO_read and 
>> PerlIO_read (differing only by the "L") are really the same thing, i.e.,  
>> both boil down to read().  I guess we can't simplify this code until and 
>> unless using stdio as the bottom layer is truly deprecated and expunged.
> 
> I don't think you're correct on that one. read() is not stdio. It's (at least
> on Unix) a syscall. fread() is stdio, and loops on read() until it gets enough
> octets.

Yes, I know read() is not stdio and that fread() is.  That was my point.  
Unless I'm really missing something, there is no fread() involved anymore since 
unix is the bottom PerlIO layer.  The comment in the code about avoiding 
fread() on VMS was only relevant when stdio was the bottom layer, and I believe 
it may have been the only layer at all when that comment was written.

PerlIO_read (from the branch of the else that you snipped):

    {
        bytesread = PerlIO_read(fp, buffer, recsize);
    }

used to be a wrapper around fread() when stdio was the bottom layer, but is now 
a wrapper around read().   Which means that both branches of the if do exactly 
the same thing: call read().  Which means we could get rid of the VMS-specific 
code in S_sv_gets_read_record but *only* if we were willing to say that stdio 
can't ever be the bottom layer (as opposed to no longer being the default 
bottom layer).

> So the code for VMS (if I'm following it correctly) is still grabbing
> a single record.

Yes, calling read() grabs a single record.  My point was just that unless we 
configure with -UUSEPERLIO, we'll currently get read() from both branches of 
that if.

> The reason I'm specifically asking "what does a VMS programmer *want*?" is
> because the fixed size records feature was put in for VMS, with non-VMS an
> afterthought. So
> 
> 1) is there a sane VMS native interpretation of "UTF-8 coming from a fixed
>   record file" ?

No.  Over in [perl #100058] I started but never sent a response to David 
Nicol's question that may be relevant here:

On Fri, Oct 14, 2011 at 3:11 PM, David Nicol <davidni...@gmail.com> wrote:
>
> Is anyone here actually shoehorning UTF8 into fixed-length records, using
> any system besides Perl to do it?

I work with record-oriented files, fixed-length and variable-length, almost 
every day and I have done so off and on for many years.  My experience is 
certainly not comprehensive and my memory may be faulty, but there is no 
scenario I can remember or imagine where any character interpretation at all 
(even ASCII) would be imposed on a fixed-length record.  The record may very 
well contain structured data, and some of the fields in it may contain 
character data.  Interpreting that data has nothing to do with processing the 
records and vice versa.

> and only when that's answered is there
> 
> 2) what do we fake on other platforms?

Yeah, now for the hard part.  I'm not much of a language designer and not much 
of a Unicode wonk, but my feeling is that reading a specific number of 
characters in one go when the characters are of variable size is a problem that 
is utterly different from and unrelated to dealing with fixed-length records.  
It is a third way of defining a record that is as different from defining it by 
length or delimiter as those two ways are from each other.

I understand that something must be done to fix the mayhem that results when 
imposing :utf8 on a byte stream that may get truncated mid-character.  I don't 
know the best way to do that, but I don't think pretending that the byte stream 
is not a byte stream makes sense.

> [and I think it's also premature to consider whether this needs :utf8 as a
> real layer to implement. I'd like to get a feeling for what the Perl space
> behaviour, if any, should be]
> 
> 
> The possibly useful analogy is "what happens with a :utf8 layer on sysread?"
> which is, well, summed up with:
> 
>           goto more_bytes;
> 
> ie - it's actually a different behaviour. It makes multiple syscalls. Blech.
> 
> [and, thinking about it now, about 14 years later, possibly that non-VMS
> code in sv_gets() should have been using read(), not fread(), so that it
> would be useful on a datagram socket. But that's a bit late to fix]

Again, I'm almost positive we switched that fread() to read() when we switched 
the default bottom layer from stdio to unix.

________________________________________
Craig A. Berry
mailto:craigbe...@mac.com

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser

Re: what was VMS do here? (was [perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input)

Reply via email to