On 12/17/2014 02:20 AM, Isaac Dunham wrote:
> On Tue, Dec 16, 2014 at 04:34:47PM -0600, Rob Landley wrote:
>>> tar copy_in_out():
>>>   die on short read, try to avoid but ignore short write (calls writeall())
>>
>> It's in pending because I haven't cleaned it up yet, but should
>> presumably work like cpio.
> 
> I had been thinking about making it abort on short read or write.

Archiving directories that are in use (such as a user's entire home
directory) is a reasonably common use case. Aborting on the first file
that doesn't work like expected means you have to quiesce the machine to
snapshot its state. You don't have to do that with busybox or ubuntu tar
implementations.

Doing that during extract is understandable because the data's already
serialized so you can't recover from premature EOF. But doing that on
create is just lazy. (It should exit with an error code at the end, but
it should create the archive in the meantime.)

Same for cp and mv and rm and such. Inability to deal properly with a
file in the tree doesn't abort the tree traversal. (Some of this is
explicit in posix. Tar is one posix dropped the ball on, but the right
thing to do is still obvious.)

>>> But ar would seem to want a slightly different approach for some 
>>> functions, which would not be compatible with any of the current 
>>> archivers (ar is the only non-streaming archiver so far):
>>
>> Because I want to eventually implement even things like gene2fs,
>> mkfatfs, and mkisofs as streaming archivers. (I can't do streaming zip
>> decompression because the metadata's at the end and I'd have to buffer
>> the whole archive anyway, but absent a reason like that being able to
>> pipe straight to/from gzip or ssh is worth a little extra effort.)
> 
> ar could theoretically be extracted by a streaming archiver, 
> but it cannot be written by one. There are two special entries that
> always happen at the start: one stores a symbol index, the other
> is a list of long file names.

Yeah, ext2 has this too. The superblock and inode tables and such have
information about filesystem state that involves reading ahead. This is
why I made DIRTREE_SAVE that can traverse the directories and snapshot
the state before writing out the metadata:

http://landley.net/notes-2007.html#02-02-2007

Real life's been getting in the way of doing this stuff for a long time.
Although I haven't gone back to gene2fs because it's one of those big
things like sed/awk/vi/sh that needs an "eat your brain" month to get it
right, and I got interrupted in my earlier pass and need to re-learn all
the details like:

http://landley.net/notes-2007.html#16-02-2007

> It is also *required* to support insertion at arbitrary locations
> (before or after specified files).

Meaning it reads the old one and writes a new one.

> Besides that, POSIX says "STDIN: Not used."
> Which I'm *quite* happy with.

Posix doesn't mention sed -i either. Posix only mentions tar in the
context of pax (and its version won't work on modern systems, the file
length and symlink behavior is wrong), and it deprecated cpio in 2001
and only ever documented the 6 byte (not 8 byte) version...

>> ssize_t according to man 2 sendfile. I just hadn't yet because nothing
>> was using it. If I expand it, I'd want to move towards convergence with
>> the syscall... except that gratuitously wants one of its arguments to be
>> a network socket for NO OBVIOUS REASON.
> 
> By "indicate bytes written" I mean "return the total number of 
> bytes written".
> 
> According to my man pages, "In Linux kernels before 2.6.33, out_fd 
> must refer to a socket. Since Linux 2.6.33 it can be any file."

That's still recent enough (2010) I need a probe, but yeah we should use it.

> Using sendfile will of course require a loop if you have a file larger
> than half the address space;

Why? If you enable long file support in libc (hardwired on in musl,
present in 2.4, _not_ enabling that is pilot error) then size_t should
be 64 bit?

> I was hoping for something that won't 
> croak if given a 9-gigabyte file on a 32 bit computer. 
> ar was clearly *intended* to use 32-bit off_t, but the record can store 
> any file size less than 10^10 bytes.

lib/portability.h line 27:

// Always use long file support.
#define _FILE_OFFSET_BITS 64

Limitations of the file format are another matter, but storing a single
.o file larger than 2 gigs should never happen even on a 64 bit system.
(Truncate and error_message("too long '%s'", filename); which
automatically sets the error number we exit to 1 if it was 0. Doesn't
set it if it was already nonzero, to preserve the specific error value
if something cares.)

>> When then the pipe improvements went into the kernel they were talking
>> about improving it to work with any two arbitrary filehandles, but I'd
>> need some sort of version probe to see whether I could use it or have to
>> fall back to the C implementation, and it's on the todo list...
>>
>> http://blog.superpat.com/2010/06/01/zero-copy-in-linux-with-sendfile-and-splice/
>>
>> I dunno if the genericization work made it upstream or if it needs
>> splice() still. (There was talk about it on lwn.net at one point...)
>>
>> Dear thunderbird, genericization is so a word. Oh good grief it's got a
>> squiggle under thunderbird. I am not capitalizing the t. Deal with it.
>>
>> Ok, if the above two are separate sentences, full of squiggles. Together
>> as a paragraph: no squiggles. But "Ok" has a squiggle. I'm going to give
>> up trying to understand this email client now.
> 
> Yeah, I gave up on spill chuckers years ago.

Chuckers has a squiggly underline.

> I think the rules are something like "here's a massive list of words;
> anything lowercase that's in it can be capitalized after [?!.\n],
> anything uppercase must be matched exactly, and everything else is
> a mistake".
> Check if you have wamerican-huge installed and the language set to
> some variant of en_US...in all the dozens of places it should be set.

I've stopped caring about their implementation, what I want to know is
why they _attempt_ to do something they so clearly fail at. If you can't
get it even remotely right, don't go there. (To which I expect they
reply "but doubtlook!" and I facepalm and move on.)

>>>  I suppose I could use xsendfile() and then lseek() rather than refactoring
>>>  xsendfile().
>>
>> Is there a reason working like cpio is not an option?
> 
> Because (0) I didn't want to refactor the input loop of cpio (it seems to
> have gotten a bit more daunting...), (1) we don't have to worry about 
> things like passthrough mode, since we *always* are dealing with 
> seekable files,

ar c <(ssh user@addr cat filename.o)

I have done stranger things. On a fairly regular basis, actually.

> (2) and it makes no sense to have ar create even 
> partly corrupt libraries or archives.

Since this archiver has a very dominant primary use, you have a point.
But for things like tar, cpio, mv, cp, rm, sed -i, filesystem
generators... not so much.

That argues _against_ trying to make it use a generic function that
would go the other way.

> (its only common applications are static libraries and debian packages,
> where it should fail hard if it doesn't work perfectly.)
> GNU ar won't leave an archive around if you specify to archive a file 
> that exists and one that doesn't--and that's the right course of action,
> for once.

It would be nice if
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ar.html
specified this, but yeah you're probably right.

> With an atexit() handler, I could simply use the extended xsendfile().

Right now only cp, tail, and bootchartd use it directly. (Which is odd
because it was created for patch, but that uses replace_tempfile() in
lib/lib.c which calls xsendfile() internally.)

>> You know how much data you're sending to it, right? I can see extending
>> xsendfile with an argument "send this many bytes", with -1 meaning "just
>> keep going until the source runs out". But the x prefix means "die if
>> you can't", and this might need more nuanced error handling than that...
>>
>>> -extract:
>>>   die on short read (corrupt file), die on short write (out of space).
>>
>> Sounds about right.
> 
>>> So I guess the sensible course is to write xcopyall() and make all the
>>> archivers use it where relevant.
>>
>> I lost the thread, you need it to do what?
>>
>> It sounds like you want xsendfile() to take a third argument, a max
>> length to copy with -1 meaning the current "until the end" behavior, and
>> then return the number of bytes copied? (With premature EOF being
>> perror_exit() territory?) Except this implies a non-x sendfile that
>> _won't_ exit for a short read, and that name's taken, but I can come up
>> with something.
> 
> That's  what I was wanting, complete with perror_exit on premature EOF; 

and cpio extract could probably use it too (line 154-ish), modulo the
strpad() stuff...

> (x)copyall seems to be a reasonable name for it when sendfile is taken.
> 
> I suppose that if xsendfile()/xcopyall with a positive length will
> always copy the full length, the return value can stay as void.

Return value is easily ignorable and with -1 you might want to know the
amount sent.

> Thanks,
> Isaac Dunham

Rob
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to