Re: EFI in Debian

2012-07-09 Thread Ted Ts'o
On Mon, Jul 09, 2012 at 04:48:38PM +0100, Matthew Garrett wrote:
 In article 20120708235244.gb24...@thunk.org  Ted Ts'o wrote:
  Matthew Garret believes that this is a requirement; however, there is
  no documented paper trail indicating that this is actually necessary.
  There are those who believe that Microsoft wouldn't dare revoke a
  Linux key because of the antitrust issues that would arise.
 
 Hey, it's hardly my fault that nobody else bothered turning up to the
 well-advertised events where this got discussed...

If it's documented on paper, it didn't happen.  :-)

Discussions in smoke-filled rooms, even if they are well-advertised,
don't really impress me  (This isn't your fault, but Microsoft's.)

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120709162649.ga7...@thunk.org



Re: EFI in Debian

2012-07-08 Thread Ted Ts'o
On Sun, Jul 08, 2012 at 10:00:05AM -0600, Paul Wise wrote:
 On Sun, Jul 8, 2012 at 7:15 AM, Wookey wrote:
  Will Android machines make secure boot turn-offable or another key
  installable, or will thay follow the Microsoft lead and lock
  everything down too?
 
 Are there any Android devices that aren't *already* bootloader locked
 or require jailbreaking to get root? I don't think Microsoft is
 creating a trend here, locked down ARM devices are already the norm
 AFAICT.

The Galaxy Nexus (and Nexus devices in general) can be unlocked by
simply running the fastboot oem unlock command which is distributed
as part of the Android SDK.  The unlock process will erase all of the
user data for security reasons (so that if someone steals your phone,
they can't use the unlock process to break security and grab all of
your data, including silly things like the authentication cookies
which would allow an attacker access to your google eaccount).

HTC and ASUS have also been selling their newer android with an
unlocked bootloader.  Most Samsung devices are shipped with unlocking
tools, so it came as a bit a surprise when the Verizon Samsung Galaxy
S3 came with a locked bootloader.  Some have blamed Verizon, but
there's no proof of that as far as I know.

So in answer to your question, there are plenty of Android devices
which are trivially unlockable.  (And once a Nexus phone is unlocked,
it's you can get a root shell trivially; no jail-breaking necessary.
Of course this is true for an attacker/thief who has managed to steal
your phone, but if you want to unlock the phone, it's easily doable on
many Android devices.)

   - Ted

P.S.  Personally, I recommend that people buy SIM unlocked, and easily
boot-unlocked Android phones; and if you get Google Experience Nexus
that isn't subsidized by Carriers, its firmware updates don't have to
get approved by carriers.  It also means you don't get any
carrier-mandated or handset-manfacturer-mandated bloatware.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120708233048.ga24...@thunk.org



Re: EFI in Debian

2012-07-08 Thread Ted Ts'o
On Fri, Jul 06, 2012 at 05:32:44AM +0100, Ben Hutchings wrote:
 
 2. Upstream kernel support: when booted in Secure Boot mode, Linux would
 only load signed kernel modules and disable the various debug interfaces
 that allow code injection.  I'm aware that David Howells, Matthew
 Garrett and others are working on this.

Matthew Garret believes that this is a requirement; however, there is
no documented paper trail indicating that this is actually necessary.
There are those who believe that Microsoft wouldn't dare revoke a
Linux key because of the antitrust issues that would arise.

This would especially true if the bootloader displayed a spash screen
with a huge penguin on it, and the user was obliged to hit a key
acknowledging the spash screen before the boot was allowed to
continue.  James is working on a signed bootloader which would do
this.

It's not even obvious that the spash screen is needed, BTW.  Canonical
is not using a splash screen and is not signing the kernel or kernel
modules.  It will be *very* interesting if Microsoft dares to revoke
Canonical's certificate, or refuse to issue a certificate.  I'm sure
there are developers in Europe who would be delighted to call this to
the attention of the European Anti-Trust regulators --- you know, the
ones who have already fined Microsoft to the tune of 860 million Euros
($1.1 billion USD).

So personally, I would hope that at least some distributions will
patch out the splash screen, and apply for a certificate.  If we have
multiple distributions using different signing policies and slightly
different approaches (which is the beauty of free/open source boot
loaders; everyone can tweak things slightly), we can see how Microsoft
will react.

It should be entertaining

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120708235244.gb24...@thunk.org



Re: multiarch, required packages, and multiarch-support

2012-06-15 Thread Ted Ts'o
On Thu, Jun 14, 2012 at 09:22:43PM -0700, Russ Allbery wrote:
 Theodore Ts'o ty...@mit.edu writes:
 
  If a required package (such as e2fslibs, which is required by e2fsprogs)
  provides multiarch support, then Lintian requires that the package have
  a dependency on the package multiarch-support[1].
 
  However, this causes debcheck to complain because you now have a
  required package depending on a package, multiarch-support, which is
  only at standard priority[2] 
 
 multiarch-support should be priority: required.  It's already a dependency
 of several other priority: required packages, such as libselinux1 and
 zlib1g.
 
 That implies that in the interim you should ignore debcheck.

Thanks, I've filed a bug against multiarch-support to this effect.


  - Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120615131712.gb30...@thunk.org



Re: /tmp on multi-FS set-ups, or: block users from using /tmp?

2012-05-26 Thread Ted Ts'o
On Sat, May 26, 2012 at 09:29:30PM +0700, Ivan Shmakov wrote:
   … But that makes me recall a solution to both the /tmp and quota
   issues I've seen somewhere: use ~/tmp/ instead of /tmp.  This
   way, user's temporary files will be subject to exactly the same
   limits as all the other his or her files.
 
   (Still, we may need to identify the software that ignores TMPDIR
   and tries to write to /tmp unconditionally.)
 
   (Snark aside, does tmpfs support quotas yet/will it ever?)

These days I'd argue that multi-user is such a corner case that it's
not worth optimizing for it as far as defaults are concerned.  If
you're trying to run a secure multi-user system, you need to be an
expert system administrator, keep up with all security patches, and
even then, good luck to you.  (The reality is that these days, no
matter what OS you're talking about, shell == root.  And that's
probably even true on the most unusably locked down SELinux system.)

What I'd do in that situation is to use per-user /tmp directories,
where each user would get its own mount namespace, and so each user
would have its own /tmp --- either a bind-mounted $(HOME)/tmp to /tmp
if you want to enforce quotas that way, or a separate tmpfs for each
user --- and then you can specify the size of the per-user tmpfs
mounted on each user's version of /tmp.

Cheers,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120526183215.ga11...@thunk.org



Re: Moving /tmp to tmpfs makes it useless

2012-05-25 Thread Ted Ts'o
On Fri, May 25, 2012 at 11:11:06AM +0200, Salvo Tomaselli wrote:
  Files which are written on a regular filesystem stay on memory. This is
  called the buffer cache. Whenever they are not used and/or the system
  needs to reclaim memory, they are trashed.
  Files which are written on a tmpfs stay on memory. Whenever they are not
  used and/or the system needs to reclaim memory, they are swapped.
  
  See? No difference.
 
 You seem to forget that memory is not an unlimited resource, the
 system might need it for other things, and in that case a large
 tmpfs causes severe slowdown (and even complete freeze).

So what?  If you write to a normal file system, it goes into the page
cache, which is pretty much the same as writing into tmpfs.  In both
cases if you have swap configured, the data will get pushed to disk;
either to the file system or to swap, as memory requirements dictate.

The main advantage of tmpfs is that it gets wiped on reboot, and so it
prevents people and applications from thinking that they can keep
stuff in /tmp forever.  It's also faster because a file system has to
do extra work to make sure the files are preserved after a reboot.

 - Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120525125238.ga5...@thunk.org



Re: Moving /tmp to tmpfs makes it useful

2012-05-25 Thread Ted Ts'o
On Fri, May 25, 2012 at 02:49:14PM +0100, Will Daniels wrote:
 On 25/05/12 13:52, Ted Ts'o wrote:
 So what?  If you write to a normal file system, it goes into the page
 cache, which is pretty much the same as writing into tmpfs.  In both
 cases if you have swap configured, the data will get pushed to disk;
 
 That's not at all the same, the page cache is more temporary, it's
 getting flushed to disk pretty quick if memory is tight (presumably)
 but in the same situation using tmpfs going to swap is surely going
 to be more disruptive?

There will be some, but really, not that much difference between going
from tmpfs to swap compared to files written to a filesystem (in both
cases the data is stored in the page cache, whether it's a tmpfs file
or an ext2/3/4 or xfs or btrfs file) in many cases.

The major difference is that tmpfs pages only get written out to swap
when the system is under memory pressure.  In contrast, pages which
are backed by a filesystem will start being written to disk after 30
seconds _or_ if the system is under memory pressure.

So if you have a super-fast networking connection, it won't matter;
the download fill the memory very quickly, at which point it will get
written to swap and/or the file location on disk at the same rate,
since you'll be able to download faster than you can save to disk, so
the network connection will get throttled due to TCP/IP backpressure
to roughly the same rate as you are writing to the HDD.

If you have a slow networking connection, it's possible that the 30
second writeback timer will kick off before you start filling the
memory --- but in that case, it's not going to be that disrupting in
the tmpfs case, either.  You'll hit memory pressure, and at that point
you'll start writing to disk perhaps a bit later than the 30 second
writeback timer.  But at the same time, the download is coming in
slowly enough that you're probably not overwhelming the speed at which
you can write to the HDD or SSD.

The place where it will make a difference is if you have a very large
amount of memory, *and* you are downloading a really big file to /tmp
(substantially bigger than your physical memory), *and* your effective
end-to-end download speed is faster than your HDD write speed, but
slow enough that it takes substantially longer than 30 seconds to
download enough to fill your free physical memory.  But that's
actually a pretty narrow window.

 And anyway, not everybody uses swap, in which case this default is
 not entirely viable. I, for one, had no idea this had become default
 for Debian and I think it's likely to be one of those things that
 jumps out to bite people who weren't expecting it at some
 inconvenient moment.

Well, it's all about defaults, right?  It's easy enough to set up a
swap partition, or even just a swap file by default in the installer.
You can set up a swap file on the fly, so it's not that hard to deal
with it after the fact.

 I'm sure the project veterans and more attentive readers of this
 list are tired of recurring arguments like this, but usually if
 something is recurring it is for a reason. Given my general no
 swap preference, I'm glad this has come up again so that I'm aware
 of it this time.
 
 The tmpfs setup seems far more appropriate as a performance tweak
 for admins than as a default. Where there is plenty of RAM, buffer
 cache makes the difference largely negligible. But where there isn't
 an abundance of RAM, it could needlessly cause problems (especially
 without swap).

If you're worried about installations which don't have much memory
(i.e., the 512mb netbook), then swap is absolutely mandatory, I would
think!

And if you consider how much memory most desktop/laptops have, and how
often people **really** are downloading multi-gigabyte files to /tmp
(note that browsers tend to default downloads to ~/Downloads), I
really think the people who are agitating against tmpfs are really
making a much more theoretical argument than one that seems likely to
be hit by an unsophisticated user --- and a sophistcated user can
easily decide whether to use /tmp on disk or not.

 - Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120526015655.ga5...@thunk.org



Re: udeb and data.tar.xz files?

2012-05-14 Thread Ted Ts'o
On Mon, May 14, 2012 at 09:51:43PM +0200, Niels Thykier wrote:
 
 Lintian is outdated (#664600) and the fix has been commited to the git
 repository[1]

I saw a bug report requesting that packages that failed the lintian
udeb-uses-non-gzip-data-tarball check should be summarily rejected.
Did this actually get implemeted on the ftp server?  i.e., do I need
to wait until the new version of lintian gets propgated out to the
debian servers, or if I'm in a hurry, hack dh_builddeb to generate a
udeb that doesn't use data.tar.xz?

I just want to know what my options are.

Thanks,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120514200131.ga17...@thunk.org



Re: udeb and data.tar.xz files?

2012-05-14 Thread Ted Ts'o
On Mon, May 14, 2012 at 10:20:08PM +0200, Philipp Kern wrote:
 
 as soon as we get a hold of an ftp-master the autoreject will be dropped.
 We'll certainly don't wait until Lintian is backported. ;-)
 

Great, thanks for the clarification.  I wasn't aware of

http://ftp-master.debian.org/static/lintian.tags

as how the auto-reject was configured, so I didn't know how to check
on my own or that it would be really easy to fix.

I'll wait until that gets fixed up and upload my updated packages
then.

Thanks again,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120514202909.gd1...@thunk.org



Re: Bug#652275: Guided partitioning should not offer separate /usr, /var, and /tmp partitions; leave that to manual partitioning

2011-12-27 Thread Ted Ts'o
On Fri, Dec 16, 2011 at 02:38:11PM +, Lars Wirzenius wrote:
 On Fri, Dec 16, 2011 at 02:13:29PM +0100, Stig Sandbeck Mathisen wrote:
  Simon McVittie s...@debian.org writes:
  
   life's too short to spend time booting in single-user mode and
   resizing LVs.
  
  That's probably why we now have online resizing of LVs and filesystems
 
 resize2fs, at least, only supports online resizing to make the filesystem
 larger, not smaller. It's not particularly useful for, say, the root
 filesystem.

FYI, the resize2fs proram does support shrinking off-line shrinking of
ext3 file systems.  It doesn't currently support off-line resizing of
ext4 file systems (just on-line growth), but that's something I
consider a bug that I just haven't had the time to get around to fix.

Regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20111227212937.gb10...@thunk.org



Re: Bug#652275: Guided partitioning should not offer separate /usr, /var, and /tmp partitions; leave that to manual partitioning

2011-12-27 Thread Ted Ts'o
On Wed, Dec 21, 2011 at 02:32:58PM +0100, Goswin von Brederlow wrote:
  If we want to improve fsck time then the best thing to do would be
  to consider a different default value for the -i option of mke2fs.

This advice is not applicable for ext4, since it will not read unused
portions of the inode table.  There have been a number of improvements
in the ext4 file system format which means that in general fsck times
for ext4 are around 7-12 times faster than the equivalent ext3 file
system.

  As an aside mke2fs -t ext4 includes huge_file, dir_nlink, and
  extra_isize while mke4fs doesn't.  This difference seems wrong to
  me.
 
 Urgs. +1.

I've never heard of mke4fs --- who thought up that abortion?

mke2fs -t ext4 and mkfs.ext4 will both do the right thing, as far
as creating file systems that have the correct ext4 file system
features for a file system designed to be mounted using the ext4 file
system driver in modern Linux kernels.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20111227211736.ga10...@thunk.org



Re: Could the multiarch wiki page be explicit about pkgconfig files?

2011-09-25 Thread Ted Ts'o
On Mon, Sep 19, 2011 at 10:00:35PM +0200, Josselin Mouette wrote:
 Le lundi 19 septembre 2011 à 18:56 +0100, Simon McVittie a écrit : 
   The correct place for debug files is a hash-based path, instead of the
   crapfuck we have today.
  
  ... but until then, for gdb to pick them up, debug symbols for $THING must 
  be
  in /usr/lib/debug/$THING (a general rule, independent of multiarch),
 
 No, gdb is perfectly able to pick them from /usr/lib/debug/.build-id/.

Is this the right long-term path?  It seems wierd that we would be
using a hidden dot-file.  Is there a reason why it isn't
/usr/lib/debug/build-id?

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110925122251.ga17...@thunk.org



Re: Could the multiarch wiki page be explicit about pkgconfig files?

2011-09-19 Thread Ted Ts'o
On Mon, Sep 19, 2011 at 08:51:00AM +0200, Tollef Fog Heen wrote:
 ]] Theodore Ts'o 
 
 | and that's the correct location of pkgconfig files, which currently are
 | stored at /usr/lib/pkgconfig/lib.pc.   The Wiki page seems to imply
 | the correct location is /usr/lib/triplet/pkgconfig/lib.pc.   And
 | I've received patches that drop them there.
 
 /usr/lib/triplet/pkgconfig/lib.pc is correct.
 
 pkg-config 0.26-1 and newer looks there by default, so you probably want
 to either depend on that version or newer or add a Breaks against older
 versions.

OK, how about /usr/lib/triplet/debug/sbin/e2fsck?

I just checked and gdb doesn't find the debugging symbols if I drop the
debug files under /usr/lib/triplet.  What is the planned correct
place for the debug files?

Thanks,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110919165201.ga29...@thunk.org



Re: Bug#616317: base: commit= ext3 mount option in fstab has no effect.

2011-05-07 Thread Ted Ts'o
reassign 616317 base
thanks

This isn't a bug in e2fsprogs; e2fsprogs has absolutely nothing to do
with mounting the file system.

Debian simply doesn't support the mount options for the root file
system in /etc/fstab having any effect on how the root file system is
mounted.  The root file system is mounted by the kernel, and the mount
options used by the kernel are specified by the rootflags= option on
the kernel's boot command line.

This is effectively a feature request, and I debated what was the best
way to deal with this bug.  I could close it, and say, not a bug,
since Debian has never worked this way, and I suspect it was
deliberate.

Or, I could assign it to initramfs-tools, since what some other
distributions do is look in /etc/fstab, parse out the mount options in
for the root file system in /etc/fstab, and then insert into initrd
image the appropriate root mount options.  The problem with this is,
(a) it's a bit of a hack, (b) it only takes effect the next time you
install a new kernel, or if you deliberately and explicitly run
mkinitramfs, which has fairly baroque options that most users would
never figure out, and (c) not all Debian installations use an initrd,
so whether or not it works would depend on how the boot sequence was
set up.  If you don't use an initrd, you'd have to edit it into the
grub's configuration file.  But then, not all Debian systems use grub
as their boot loader.

Neither these seemed obviously the right choice.

So I'm going to do the cowardly thing, and choose the third option,
which is to reassign this back to base, cc'ing debian-devel.  I'm not
sure what the right thing is to do here, since honoring this feature
request would require making changes to multiple different packages:
initramfs-tools, all of the bootloaders, etc.

Should we try to make this work (at best badly) since a change in
mount options in /etc/fstab would only take effect at the next
mkinitramfs and/or update-grub invocation?  Or should we just close
out this bug and say, tough luck, kid; if you want to change the root
file system's mount options, you need to edit your kernel's boot
options using whatever bootloader you might happen to be using?

I have a slight preference for the latter, since it's a lot less
complexity that won't really work right anyway, but let's see what
other people think.

Regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110508024329.ga15...@thunk.org



Re: e2fsprogs as Essential: yes?

2011-03-26 Thread Ted Ts'o
On Sat, Mar 26, 2011 at 10:42:09PM +, Mark Hymers wrote:
 
 The only other thing I can see is that e2fsprogs contains lsattr and
 chattr - a quick grep through my local /var/lib/dpkg/info shows that
 chattr is used in the postfix postinst without an explicit dependency.
 I wonder if there are more instances of that?

What to do with lsattr and chattr is actually a somewhat interesting
question.  They are most definitely ext2/3/4 specific, and the
ext2/3/4 development team still adds new flags to it from time to
time, so we had no plans moving those two commands out of e2fsprogs
any time soon.  On the other hand, other file systems, including
reiserfs and btrfs, have used the same ioctl and command line
interface, and we do coordinate flags to prevent conflicts.  So other
users of other file systems will still need lsattr and chattr.

There are similar, although less serious, issues with filefrag -v
(which will work on other file systems), but which also has some
ext2/3/4 specific code it in.

Another binary which is used by other packages includes the logsave
utility, which is also in e2fsprogs, and which is used by
/etc/init.d/checkfs.sh and /etc/init.d/checkroot.sh in the initscripts
package.

Regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110326230729.gg2...@thunk.org



Re: Safe file update library ready (sort of)

2011-01-29 Thread Ted Ts'o
On Fri, Jan 28, 2011 at 07:37:02AM +0100, Olaf van der Spek wrote:
 
 Is there a way to log cases where (potentially) unsafe writes happen?
 Cases like truncation of an existing file, rename on a target that's
 not yet synced, etc.

Not really, because there are plenty of cases where it's perfectly OK
not to sync a file on close or on rename.  Any files created during a
build, for example, can easily be reproduced in the unlikely case of a
systme crash.  If you are untaring a source tree, it's fine not to
worry about syncing out the files, since presumably you can always
repeat the untar operation.  Or take git; when git is checking out
files into the working directory, there's no reason that has to be
done in a super-safe way.  On the other hand, when it is writing the
git object files and pack files, those had better be done safely.  At
the end of the day the application programmer needs to understand what
is going on, and write his code appropriately based on the needs of
his application with respect to reliability after a power crash.

So how can you just log warnings that the program has just done
something unsafe?  It's unsafe only if there's no other way to
reconstitute the data that was just written.  But that's not something
which is easily knowable.

(I know, I'm being *so* unfair; I'm expecting application programmers
to be competent...)

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110129125308.ga19...@thunk.org



Re: Safe file update library ready (sort of)

2011-01-27 Thread Ted Ts'o
On Wed, Jan 26, 2011 at 06:14:42PM +0100, Olaf van der Spek wrote:
 On Wed, Jan 26, 2011 at 5:36 PM, Hendrik Sattler
 p...@hendrik-sattler.de wrote:
  BTW: KDE4 is a very good example for failure with modern filesystems. I
  regularly loose configuration files when suspend-to-ram fails even if the
  configuration of the running programs were not changed. Yay :-( And this is
  with XFS, not Ext4! Filed a bug a looong time ago in KDE BTS. Reaction:
  none!
 
 Maybe complain to the Linux kernel people instead.

It won't be just XFS or ext4, but any file system except ext3 (which
has performance problems specifically *because* of an implementation
detail accidentally provided this feature you like), and I think what
you'll find is that most Linux kernel developers will tell you is that
it's a bug in the application.  

If you don't like that answer, you'll find that it's true for any
other OS (i.e., BSD, OpenSolaris, etc.)  --- so either KDE needs to
get with the program, or find its users gradually switching to other
windowing systems that have sanely written libraries.

- Ted

P.S.  There is a kernel options that provide improved ext3
performance, to wit, CONFIG_EXT3_DEFAULTS_TO_ORDERED=no, which will
also mean that you had better use fsync() if you want files pushed out
to disk.  So strictly speaking, it's not even true that KDE4 is
guaranteed to be safe if you use ext3.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110128022642.ga15...@thunk.org



Re: Safe File Update (atomic)

2011-01-05 Thread Ted Ts'o
On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote:
  If you give me a specific approach, I can tell you why it won't work,
  or why it won't be accepted by the kernel maintainers (for example,
  because it involves pouring far too much complexity into the kernel).
 
 Let's consider the temp file workaround, since a lot of existing apps
 use it. A request is to commit the source data before committing the
 rename. Seems quite simple.

Currently ext4 is initiating writeback on the source file at the time
of the rename.  Given performance measurements others (maybe it was
you, I can't remember, and I don't feel like going through the
literally hundreds of messages on this and related threads) have
cited, it seems that btrfs is doing something similar.  The problem
with doing a full commit, which means surviving a power failure, is
that you have to request a barrier operation to make sure the data
goes all the way down to the disk platter --- and this is expensive
(on the order of at least 20-30ms, more if you've written a lot to the
disk).

We have had experience with forcing data writeback (what you call
commit the source data) before the rename --- ext3 did that.  And it
had some very nasty performance problems which showed up very busy
systems where people were doing a lot of different things at the same
time: large background writes from bittorrents and/or DVD ripping,
compiles, web browsing, etc.  If you force a large amount of data out
when you do a commit, everything else that tries to write to the file
system at that point stops, and if you have stupid programs (i.e.,
firefox trying to do database updates on its UI loop), it can cause
programs to apparently lock up, and users get really upset.

So one of the questions is how much should be penalizing programs that
are doing things right (i.e., using fsync), versus programs which are
doing things wrong (i.e., using rename and trusting to luck).  This is
a policy question, for which you might have a different opinion than I
might have on the subject.

We could also simply force a synchronous data writeback at rename
time, instead of merely starting writeback at the point of the rename.
In the case of a program which has already done an fsync(), the
synchronous data writeback would be a no-op, so that's good in terms
of not penalizing programs which do things right.  But the problem
there is that there could be some renames where forcing data writeback
is not needed, and so we would be forcing the performance hit of the
commit the source data even when it might not be needed (or wanted)
by the user.

How often does it happen that someone does a rename on top of an
already-existing file, where the fsync() isn't wanted.  Well, I can
think up scenarios, such as where an existing .iso image is corrupted
or needs to be updated, and so the user creates a new one and then
renames it on top of the old .iso image, but then gets surprised when
the rename ends up taking minutes to complete.  Is that a common
occurrence?  Probably not, but the case of the system crashing right
after the rename() is someone unusual as well.  

Humans in general suck at reasoning about low-probability events;
that's why we are allowing low-paid TSA workers to grope
air-travellers to avoid terrorist blowing up planes midflight, while
not being up in arms over the number of deaths every year due to
automobile accidents.

For this reason, I'm cautious about going overboard at forcing commits
on renames; doing this has real performance implications, and it is a
computer science truism that optimizing for the uncommon/failure case
is a bad thing to do.

OK, what about simply deferring the commit of the rename until the
file writeback has naturally completed?  The problem with that is
entangled updates.  Suppose there is another file which is written
to the same directory block as the one affected by the rename, and
*that* file is fsync()'ed?  Keeping track of all of the data
dependencies is **hard**.   See: http://lwn.net/Articles/339337/

  But for me to list all possible approaches and tell you why each one
  is not going to work?  You'll have to pay me before I'm willing to
  invest that kind of time.
 
 That's not what I asked.

Actually, it is, although maybe you didn't realize it.  Look above,
and how I had to present multiple alternatives, and then shoot them
all down, one at a time.  There are hundreds of solutions, all of them
wrong.

Hence why *my* counter is --- submit patches.  The mere act of
actually trying to code an alternative will allow you to determine why
your approach won't work, or failing that, others can take your patch,
apply them, and then demonstrate use cases where your idea completely
falls apart.  But it means that you do most of the work, which is fair
since you're the one demanding the feature.

It doesn't scale for me to spend a huge amount of time composing
e-mails like this, which is why it's rare that I do that.  You've
tricked me into 

Re: Safe File Update (atomic)

2011-01-05 Thread Ted Ts'o
On Wed, Jan 05, 2011 at 09:38:30PM +0100, Olaf van der Spek wrote:
 
 Performance is important, I agree.
 But you're trading performance for safety here.

... but if the safety is not needed, then you're paying for no good
reason.  And if performance is needed, then use fsync().

  OK, what about simply deferring the commit of the rename until the
  file writeback has naturally completed?  The problem with that is
  entangled updates.  Suppose there is another file which is written
  to the same directory block as the one affected by the rename, and
  *that* file is fsync()'ed?  Keeping track of all of the data
  dependencies is **hard**.   See: http://lwn.net/Articles/339337/
 
 Ah. So performance isn't the problem, it's just hard too implement.
 Would've been a lot faster if you said that earlier.

Too hard to implement doesn't go far enough.  It's also a matter of
near impossibility to add new features later.  BSD FFS didn't get
ACL's, extended attributes, and many other features ***years*** after
Linux had them.  Complexity is evil; it leads to bugs, makes things
hard to maintain, and it makes it harder to add new features later.

But hey, if you're so smart, you go ahead and implement them yourself.
You can demonstrate how you can do it better than everyone else.
Otherwise you're just wasting everybody's time.  Complex ideas are not
valid ones; or at least they certainly aren't good ones.

   - Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110105213737.gp2...@thunk.org



Re: Safe File Update (atomic)

2011-01-05 Thread Ted Ts'o
On Wed, Jan 05, 2011 at 10:47:03PM +0100, Olaf van der Spek wrote:
 
 That was about soft updates. I'm not sure this is just as complex.

Then I invite you to implement it, and start discovering all of the
corner cases for yourself.  :-)  As I predicted, you're not going to
believe me when I tell you it's too hard.

 I was thinking, doesn't ext have this kind of dependency tracking already?
 It has to write the inode after writing the data, otherwise the inode
 might point to garbage.

No, it doesn't.  We use journaling, and forced data writeouts, to
ensure consistency.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110106005456.gq2...@thunk.org



Re: Safe File Update (atomic)

2011-01-05 Thread Ted Ts'o
On Thu, Jan 06, 2011 at 12:57:07AM +, Ian Jackson wrote:
 Ted Ts'o writes (Re: Safe File Update (atomic)):
  Then I invite you to implement it, and start discovering all of the
  corner cases for yourself.  :-)  As I predicted, you're not going to
  believe me when I tell you it's too hard.
 
 How about you reimplement all of Unix userland, first, so that it
 doesn't have what you apparently think is a bug!

I think you are forgetting the open source way, which is you scratch
your own itch.

The the main programs I use where I'd care about this (e.g., emacs)
got this right two decades ago; I even remember being around during
the MIT Project Athena days, almost 25 years ago, when we needed to
add error checking to the fsync() call because Transarc's AFS didn't
actually try to send the file you were saving to the file server until
the fsync() or the close() call, and so if you got an over-quota
error, it was reflected back at fsync() time, and not at the write()
system call which was what emacs had been expecting and checking.
(All of which is POSIX compliant, so the bug was clearly with emacs;
it was fixed, and we moved on.)

If there was a program that I used and where I'd care about it, I'd
scratch my own itch and fix it.  Olaf seems to really concerned about
this theoretical use case, and if he cares so much, he can either
stick with ext3, which has the property he wants purely by accident,
but which has terrible performance problem under some circumstances as
a result, or he can fix it in the programs that he cares about --- or
he can try to create his own file system (and he can either impress us
if he actually can solve it without disastrous performance problems,
or he can be depressed when no one uses it because it is dog slow).

Note that all of the modern file systems (and all of the historical
ones too, with the exception of ext3) have always had the same
property.  If you care about the data, you use fsync().  If you don't,
then you can take advantage of the fact that compiles are really,
really fast.  (After all, in the very unlikely case that you crash,
you can always rebuild, and why should you optimize for an unlikely
case?  And if you have crappy proprietary drivers that cause you to
crash all the time, then maybe you should rethink using said
proprietary drivers.)

That's the open source way --- you scratch your own itch.  I'm
perfectly satisifed with the open source tools that I use.  Unless you
think the programmers two decades ago were smarter, and people have
gotten dumber since then (Are we not men?  We are Devo!), it really
isn't that hard to follow the rules.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110106040123.ga27...@thunk.org



Re: Safe File Update (atomic)

2011-01-04 Thread Ted Ts'o
On Wed, Jan 05, 2011 at 01:05:03AM +0100, Olaf van der Spek wrote:
 
 Why is it that you ignore all my responses to technical questions you asked?
 

In general, because they are either (a) not well-formed, or (b) you
are asking me to prove a negative.  Getting people to believe that you
can't square a circle[1] is very hard, and when I was one of the
postmasters at MIT, we'd get kooks every so often saying that they had
a proof that they could square the circle, but everyone was being
unfair and ignoring them, and could we please forward this to the head
of MIT's math department with their amazing discovery.  We learned a
long time ago that it's not worth trying to argue with kooks like
that.  It's like trying teaching a pig to sing.  It frustrates you,
and it annoys the pig.

[1] http://en.wikipedia.org/wiki/Squaring_the_circle

If you give me a specific approach, I can tell you why it won't work,
or why it won't be accepted by the kernel maintainers (for example,
because it involves pouring far too much complexity into the kernel).
But for me to list all possible approaches and tell you why each one
is not going to work?  You'll have to pay me before I'm willing to
invest that kind of time.

Best regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110105002537.gi2...@thunk.org



Re: Safe File Update (atomic)

2011-01-03 Thread Ted Ts'o
On Mon, Jan 03, 2011 at 09:49:40AM -0200, Henrique de Moraes Holschuh wrote:
 
  1) You care about data loss in the case of power failure, but not in
  the case of hard drive or storage failure, *AND* you are writing tons
  and tons of tiny 3-4 byte files and so you are worried about
  performance because you're doing something insane with large number of
  small files.
 
 That usage pattern cannot be made both safe and fast outside of a full-blown
 ACID database, so lets skip it.

Agreed.

 
  2) You are specifically worried about the case where you are replacing
  the contents of a file that is owned by different uid than the user
  doing the data file update, in a safe way where you don't want a
  partially written file to replace the old, complete file, *AND* you
  care about the file's ownership after the data update.
 
 I am not sure about the file ownership, but this is the useful usecase IMO.

But if you don't care about file ownership, then you can do the write
a temp file, fsync, and rename trick.  If it's about ease of use, as
you suggest, a userspace library solves that problem.  It's *only* if
you care about the file ownership remaining the same that (2) comes
into play.

  3) You care about the temp file used by the userspace library, or
  application which is doing the write temp file, fsync(), rename()
  scheme, being automatically deleted in case of a system crash or a
  process getting sent an uncatchable signal and getting terminated.
 
 This is always useful, as well.

 and (3) is the recovery after a power failure/crash scenario

If you don't care about the file ownership issue, then recovering
after a powerfailure/crash is the last remaining case --- and you
could solve this by creating a file with an mktemp-style name in a
mode 1777 directory, where the contents of the file contains the temp
file name to be deleted by an init.d script.  This could be done in
the userspace library, and if you crash after the rename, but before
you have a chance to delete the file containing the
temp-filename-to-be-deleted, that's not a problem, since the init.d
file will find no file with that name to be deleted, and then
continue.

Hence, all of these problems can be solved in userspace, with a
userspace library, with the exception of the file ownership issue,
which you've admitted may not be all that critical.

  Is it worth it?  I'd say no; and suggest that someone who really cares
  should create a userspace application helper library first, since
  you'll need it as a fallback for the cases listed above where this
  scheme won't work.  (Even if you do the fallback in the kernel, you'll
  still need userspace fallback for non-Linux systems, and for when the
  application is run on an older Linux kernel that doesn't have all of
  this O_ATOMIC or link/unlink magic).
 
 That's what I suggested, as well.

Then we're in agreement.  :-)

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/2011010319.gg11...@thunk.org



Re: Safe File Update (atomic)

2011-01-03 Thread Ted Ts'o
On Mon, Jan 03, 2011 at 12:26:29PM +0100, Olaf van der Spek wrote:
 
 Given that the issue has come up before so often, I expected there to
 be a FAQ about it.

Your asking the question over (and over... and over...)  doesn't make
it an FAQ.  :-)

Aside from your asking over and over, it hasn't come up that often,
actually.  The right answer has been known for decades, and it's is
very simple; write a temp file, copy over the xattr's and ACL's if you
care (in many cases, such as an application's private state files, it
won't care, so it can skip this step --- it's only the more generic
file editors that would need to worry about such things --- but when's
the last time anyone has really worried about xattr's on a .c file?),
fsync(), and rename().

This is *not* hard.   People who get it wrong are just being lazy.

In the special case of dpkg, where they are writing a moderate number
of large files, and they care about syncing the files without causing
journal commits, the use of sync_file_range() on the files followed by
a series of fdatasync() calls has solved their issues as far as I know.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110103144335.gd6...@thunk.org



Re: Safe File Update (atomic)

2011-01-02 Thread Ted Ts'o
On Sun, Jan 02, 2011 at 04:14:15PM +0100, Olaf van der Spek wrote:
 
 Last time you ignored my response, but let's try again.
 The implementation would be comparable to using a temp file, so
 there's no need to keep 2 g in memory.
 Write the 2 g to disk, wait one day, append the 1 k, fsync, update inode.

Write the 2g to disk *where*?  Some random assigned blocks?  And using
*what* to keep track of the where to find all of the metadata blocks?
That information is normally stored in the inode, but you don't want
to touch it.  So we need to store it someplace, and you haven't
specified where.  Some alternate universe?  Another inode, which is
only tied to that file descriptor?  That's *possible*, but it's (a)
not at all trivial, and (b) won't work for all file systems.  It
definitely won't work for FAT based file systems, so your blithe, oh,
just emulate it in the kernel is rather laughable.

If you think it's so easy, *you* go implement it.

  How exactly do the semantics for O_ATOMIC work?
 
  And given at the momment ***zero*** file systems implement O_ATOMIC,
  what should an application do as a fallback?  And given that it is
 
 Fallback could be implement in the kernel or in userland. Using rename
 as a fallback sounds reasonable. Implementations could switch to
 O_ATOMIC when available.

Using rename as a fallback means exposing random temp file names into
the directory.  Which could conflict with files that the userspace
might want to create.  It could be done, but again, it's an awful lot
of complexity to shove into the kernel.

  highly unlikely this could ever be implemented for various file
  systems including NFS, I'll observe this won't really reduce
  application complexity, since you'll always need to have a fallback
  for file systems and kernels that don't support O_ATOMIC.
 
 I don't see a reason why this couldn't be implemented by NFS.

Try it; it should become obvious fairly quickly.  Or just go read the
NFS protocol specifications.

 As you've said yourself, a lot of apps don't get this right. Why not?
 Because the safe way is much more complex than the unsafe way. APIs
 should be easy to use right and hard to misuse. With O_ATOMIC, I feel
 this is the case. Without, it's the opposite and the consequences are
 obvious. There shouldn't be a tradeoff between safety and potential
 problems.

Application programmers have in the past been unwilling to change
their applications.  If they are willing to change their applications,
they can just as easily use a userspace library, or use fsync() and
rename() properly.  If they aren't willing to change their programs
and recompile (and the last time we've been around this block, they
weren't; they just blamed the file system), asking them to use
O_ATOMIC probably won't work, given the portiability issues.

  And of course, Olaf isn't actually offerring to implement this
  hypothetical O_ATOMIC.  Oh, no!  He's just petulently demanding it,
  even though he can't give us any concrete use cases where this would
  actually be a huge win over a userspace safe-write library that
  properly uses fsync() and rename().
 
 Not true. I've asked (you) for just such a lib, but I'm still waiting
 for an answer.

Pay someone enough money, and they'll write you the library.  Whining
about it petulently and expecting someone else to write it is probably
not going to work.

Quite frankly, if you're competent enough to use it, you should be
able to write such a library yourself.  If you aren't going to be
using it yourself, they why are you wasting everyone's time on this?

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110103032549.gc11...@thunk.org



Re: Safe File Update (atomic)

2011-01-02 Thread Ted Ts'o
On Sun, Jan 02, 2011 at 03:14:41PM -0200, Henrique de Moraes Holschuh wrote:
 
 1. Create unlinked file fd (benefits from kernel support, but doesn't
 require it).  If a filesystem cannot support this or the boundary conditions
 are unaceptable, fail.  Needs to know the destination name to do the unliked
 create on the right fs and directory (otherwise attempts to link the file
 later would have to fail if the fs is different).

This is possible.  It would be specific only to file systems that
support inodes (i.e., ix-nay for NFS, FAT, etc.).  Some file systems
would want to know a likely directory where the file would be linked
so for their inode and block allocation policies can optimize the
inode and block placement.

 2. fd works as any normal fd to an unlinked regular file.
 
 3. create a link() that can do unlink+link atomically.  Maybe this already
 exists, otherwise needs kernel support.
 
 The behaviour of (3) should allow synchrous wait of a fsync() and a sync of
 the metadata of the parent dir.  It doesn't matter much if it does
 everything, or just calling fsync(), or creating a fclose() variant that
 does it.

OK, so this is where things get trickly.  The first is you are asking
for the ability to take a file descriptor and link it into some
directory.  The inode associated with the fd might or might not be
already linked to some other directory, and it might or might not be
owned by the user trying to do the link.  The latter could get
problematical if quota is enabled, since it does open up a new
potential security exposure.

A user might pass a file descriptor to another process in a different
security domain, and that process could create a link to some
directory which the original user doesn't have access to.  The user
would no longer be able to delete file and drop quota, and the process
would retain permanent access to the file, which it might not
otherwise have if the inode was protected by a parent directory's
permissions.  It's for the same reason that we can't just implement
open-by-inode-number; even if you use the inode's permissions and
ACL's to do a access check, this allows someone to bypass security
controls based on the containing directory's permissions.  It might
not be a security exposure, but for some scenarios (i.e., a mode 600
~/Private directory that contains world-readable files), it changes
accessibility of some files.

We could control for this by only allowing the link to happen if the
user executing this new system call owns the inode being linked, so
this particular problem is addressable.

The larger problem is this doesn't solve give you any performance
benefits over simply creating a temporary file, fsync'ing it, and then
doing the rename.  And it doesn't solve the problem that userspace is
responsible for copying over the extended attributes and ACL
information.  So in exchange for doing something non-portable which is
Linux specific, and won't work on FAT, NFS, and other non-inode based
file systems at all, and which requires special file-system
modifications for inode-based file systems --- the only real benefit
you get is that the temp file gets cleaned up automatically if you
crash before the link/unlink new magical system call is completed.

Is it worth it?   I'm not at all convinced.

Can this be fixed?  Well, I suppose we could have this magical
link/unlink system call also magically copy over the xattr and acl's.

And if you don't care about when things happen, you could have the
kernel fork off a kernel thread, which does the fsync, followed by the
magic ACL and xattr copying, and once all of this completes, it could
do the magic link/unlink.

So we could bundle all of this into a system call.  *Theoretically*.
But then someone else will say that they want to know when this magic
link/unlink system call actually completes.  Others might say that
they don't care about the fsync happening right away, but would rather
wait some arbitary time, and let the system writeback algorithsm write
back the file *whenever*, but only when the file is written back
*whenever*, should the rest of the magical link/unlink happen.

So now we have an explosion of complexity, with all sorts of different
variants.  And there's also the problem where if you don't do don't
make the system call synchronous (where it does an fsync() and waits
for it to complete), you'll lose the ability to report errors back to
userspace.

Which gets me back to the question of use cases.  When are we going to
be using this monster?  For many use cases, where the original reason
why we said people were doing it wrong because they weren't doing
things right, the risk was losing data.  But if you don't do things
synchronously, and use fsync(), you'll also end up risking losing data
because you won't know about write failures --- specifically, your
program may have long exited by the time the write failure is noticed
by the kernel.  But if you use make the system call synchronous, now
there's no 

Re: Safe File Update (atomic)

2011-01-01 Thread Ted Ts'o
On Fri, Dec 31, 2010 at 09:51:50AM -0200, Henrique de Moraes Holschuh wrote:
 On Fri, 31 Dec 2010, Olaf van der Spek wrote:
  Ah, hehe. BTW, care to respond to the mail I send to you?
 
 There is nothing more I can add to this thread.  You want O_ATOMIC.  It
 cannot be implemented for all use cases of the POSIX API, so it will not
 be implemented by the kernel.  That's all there is to it, AFAIK.
 
 You could ask for a new (non-POSIX?) API that does not ask of a
 POSIX-like filesystem something it cannot provide (i.e. don't ask for
 something that requires inode-path reverse mappings).  You could ask
 for syscalls to copy inodes, etc.  You could ask for whatever is needed
 to do a (open+write+close) that is atomic if the target already exists.
 Maybe one of those has a better chance than O_ATOMIC.

The O_ATOMIC open flag is highly problematic, and it's not fully
specified.  What if the system is under a huge amount of memory
pressure, and the badly behaved application program does:

fd = open(file, O_ATOMIC | O_TRUNC);
write(fd, buf, 2*1024*1024*1024); // write 2 gigs, heh, heh heh
sleep for one day
write(fd, buf2, 1024);
close(fd);

What happens if another program opens file for reading during the
one day sleep period?  Does it get the the old contents of file?
The partially written, incomplete new version of file?  What happens
if the file is currently mmap'ed, as Henrique has asked?

What if another program opens the file O_ATOMIC during the one day
sleep period, so the file is in the middle of getting updated by two
different processes using O_ATOMIC?

How exactly do the semantics for O_ATOMIC work?

And given at the momment ***zero*** file systems implement O_ATOMIC,
what should an application do as a fallback?  And given that it is
highly unlikely this could ever be implemented for various file
systems including NFS, I'll observe this won't really reduce
application complexity, since you'll always need to have a fallback
for file systems and kernels that don't support O_ATOMIC.

And what are the use cases where this really makes sense?  Will people
really code to this interface, knowing that it only works on Linux
(there are other operating systems, out there, like FreeBSD and
Solaris and AIX, you know, and some application programmers _do_ care
about portability), and the only benefits are (a) a marginal
performance boost for insane people who like to write vast number of
2-4 byte files without any need for atomic updates across a large
number of these small files, and (b) the ability to keep the the file
owner unchanged when someone other than the owner updates said file
(how important is this _really_; what is the use case where this
really matters?).

And of course, Olaf isn't actually offerring to implement this
hypothetical O_ATOMIC.  Oh, no!  He's just petulently demanding it,
even though he can't give us any concrete use cases where this would
actually be a huge win over a userspace safe-write library that
properly uses fsync() and rename().

If someone were to pay me a huge amount of money, and told me what was
the file size range where such a thing would be used, and what sort of
application would need it, and what kind of update frequency it should
be optimized for, and other semantic details about parallel O_ATOMIC
updates, what happens to users who are in the middle of reading the
file, what are the implications for quota, etc., it's certainly
something I can entertain.  But at the moment, it's a vague
specification (not even a solution) looking for a problem.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110102070922.ga6...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-29 Thread Ted Ts'o
On Mon, Nov 29, 2010 at 02:58:16PM +, Ian Jackson wrote:
 
 This is the standard way that ordinary files for which reliability was
 important have been updated on Unix for decades.  fsync is for files
 which need synchronisation with things external to the computer (or at
 least, external to the volume) - eg, email at final dot.

This is simply not true.  And I'm speaking as someone who has been
doing Unix/Linux kernel development since the BSD 4.3 days.  (Well,
BSD 4.3+Tahoe days, to be precise.)

fsync() has always been the only guarantee that files would be on
disk.  In fact the way BSD worked, there was no guarantee that
rename() would provide any kind of file synchronization primitive;
that's actually something new.  No, in the old days, if you really
cared about a file, you would fsync() it.  Period.  End of paragraph.

It was just that in those days, the main things people cared about
where either source/text files (so the editors of the day would do the
right thing) or e-mail (and no, just for the final delivery; for all
MTA's).

The problem that caused people to get this wrong idea was because (a)
back then Unix machines tended to be more reliable, because they were
run by professionals in machine rooms, very often with UPS's.  Also,
(b) people weren't loading craptastic video drivers with buggy
proprietary kernel modules; they may have used proprietary drivers,
but kernels weren't changing all the time, and there was a lot more
careful testing of drivers before they were unloosed onto the world.

Finally (c), ext3 had as an accident of how it provided protection
against old file blocks showing up newly allocated files (something
which BSD 4.3 did __not__ protect against, by the way), had the
behaviour that renaming over a file __usually__ (but not always)
provided atomic guarantees.

(c) was especially unfortunate, because it never applied to all Linux
file systems, just to ext3, and because the fact that it did this was
also responsible for disastrous desktop performance when you had the
combination of large number of streaming writes (i.e., bittorrent,
video ripping/copying, etc.) going on in the background combined with
foreground GUI applications that were fsync-happy() --- i.e., firefox.

Lots of users have complained about the desktop performance problem,
but the reality is we can't really solve that without also taking away
the magic that made (c) happen.  Whether you solve it by using
data=writeback and stick with ext3, or switch to ext4, or switch to
XFS, or switch to btrfs --- all of these will solve the desktop
performance problem, but they also leave you vulnerable to file loss
in the case of system crashes and applications that don't use
fsync()/fdatasync().

Hence the fact that all file system developers, whether they were
btrfs developers or XFS developers or ext4 developers, made the joke
at the file system developers summit two years ago, that what the
application programmers really wanted was O_PONY, with the magic pixie
dust.   Unfortunately:

http://www.linuxformat.com/files/nopony.jpg

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129151812.gu2...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-28 Thread Ted Ts'o
I did some experimenting, and I figured out what was going on.  You're
right, (c) doesn't quite work, because delayed allocation meant that
the writeout didn't take place until the fsync() for each file
happened.  I didn't see this at first; my apologies.

However, this *does* work:

extract(a);
sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WRITE); 
extract(b.dpkg-new);
sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WRITE); 
extract(c.dpkg-new);
sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WRITE); 

sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 
sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 
sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 

fdatasync(a);
fdatasync(b.dpkg-new);
fdatasync(c.dpkg-new);

rename(b.dpkg-new, b);
rename(c.dpkg-new, c);

This assumes that files b and c existed beforehand, but a is a new file.

What's going on here?  sync_file_range() is a Linux specific system
call that has been around for a while.  It allows program to control
when writeback happens in a very low-level fashion.  The first set of
sync_file_range() system calls causes the system to start writing back
each file once it has finished being extracted.  It doesn't actually
wait for the write to finish; it just starts the writeback.

The second series of sync_file_range() calls, with the operation
SYNC_FILE_RANGE_WAIT_BEFORE, will block until the previously initiated
writeback has completed.  This basically ensures that the delayed
allocation has been resolved; that is, the data blocks have been
allocated and written, and the inode updated (in memory), but not
necessarily pushed out to disk.

The fdatasync() call will actually force the inode to disk.  In the
case of the ext4 file system, the first fdatasync() will actually push
all of the inodes to disk, and all of the subsequent fdatasync() calls
are in fact no-ops (assuming that files 'a', 'b', and 'c' are all on
the same file system).  But what it means is that it minimizes the
number of (heavyweight) jbd2 commits to a minimum.

It uses a linux-specific system call --- sync_file_range --- but the
result should be faster performance across the board for all file
systems.  So I don't consider this an ext4-specific hack, although it
probably does makes things faster for ext4 more than any other file
system.

I've attached the program I used to test and prove this mechanism, as
well as the kernel tracepoint script I used to debug why (c) wasn't
working, which might be of interest to folks on debian-kernel.
Basically it's a demonstration of how cool ftrace is.  :-)

But using this program on a file system composed of a 5400rpm laptop
drive running LVM and LUKS, I get:

mass-sync-tester -d:dpkg current: time:  0.83/ 0.01/ 0.00

versus

mass-sync-tester -n:dpkg fixed: time:  0.07/ 0.00/ 0.01

   - Ted

/*
 * Mass sync tester
 */

#define _GNU_SOURCE

#include stdio.h
#include unistd.h
#include stdlib.h
#include sys/types.h
#include sys/time.h
#include sys/stat.h
#include fcntl.h
#include sys/resource.h
#include getopt.h
#include errno.h
#include string.h

void write_file(const char *name, int sync, int sync_range)
{
	int	fd, i, ret;
	char	buf[1024];

	fd = open(name, O_WRONLY|O_TRUNC|O_CREAT, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in write_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	memset(buf, 0, sizeof(buf));
	for (i=0; i  16; i++) {
		ret = write(fd, buf, sizeof(buf));
		if (ret  0) {
			fprintf(stderr, writing %s: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	if (sync) {
		ret = fsync(fd);
		if (ret  0) {
			fprintf(stderr, fsyncing %s in write_file: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	if (sync_range) {
		ret = sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
		if (ret  0) {
			fprintf(stderr, sync_file_range %s in write_file: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	ret = close(fd);
	if (ret  0) {
		fprintf(stderr, closing %s in write_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
}

void rename_file(const char *src, const char *dest)
{
	int ret;

	ret = rename(src, dest);
	if (ret) {
		fprintf(stderr, renaming %s to %s: %s\n, src, dest,
			strerror(errno));
		exit(1);
	}
}

void sync_file(const char *name)
{
	int	fd, i, ret;

	fd = open(name, O_RDONLY|O_NOATIME, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = fsync(fd);
	if (ret  0) {
		fprintf(stderr, fsyncing %s in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = close(fd);
	if (ret  0) {
		fprintf(stderr, closing %s in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
}

void datasync_file(const char *name)
{
	int	fd, i, ret;

	fd = open(name, O_RDONLY|O_NOATIME, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in datasync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = fdatasync(fd);
	if (ret  0) {
		fprintf(stderr, 

Re: Bug#605009: serious performance regression with ext4

2010-11-26 Thread Ted Ts'o
On Fri, Nov 26, 2010 at 03:53:27PM +0100, Raphael Hertzog wrote:
 Just to sum up what dpkg --unpack does in 1.15.8.6:
 1/ set the package status as half-installed/reinst-required
 2/ extract all the new files as *.dpkg-new
 3/ for all the unpacked files: fsync(foo.dpkg-new) followed by
rename(foo.dpkg-new, foo)

What are you doing?

1) Suppose package contains files a, b, and c.  Which are you
doing?

a)  extract a.dpkg-new ; fsync(a.dpkg-new); rename(a.dpkg-new, a);
extract b.dpkg-new ; fsync(b.dpkg-new); rename(b.dpkg-new, b);
extract c.dpkg-new ; fsync(c.dpkg-new); rename(c.dpkg-new, c);

or

b)  extract a.dpkg-new ; fsync(a.dpkg-new);
extract b.dpkg-new ; fsync(b.dpkg-new);
extract c.dpkg-new ; fsync(c.dpkg-new);
rename(a.dpkg-new, a);
rename(b.dpkg-new, b);
rename(c.dpkg-new, c);

or

c)  extract(a.dpkg-new);
extract(b.dpkg-new);
extract(c.dpkg-new);
fsync(a.dpkg-new);
fsync(b.dpkg-new);
fsync(c.dpkg-new);
rename(a.dpkg-new, a);
rename(b.dpkg-new, b);
rename(c.dpkg-new, c);


(c) will perform the best for most file systems, including ext4.  As a
further optimization, if b and c does not exist, of course it
would be better to extract into b and c directly, and skip the
rename, i.e.:

d)  extract(a.dpkg-new);
extract(b); # assuming the file b does not yet exist
extract(c); # assuming the file c does not yet exist
fsync(a.dpkg-new);
fsync(b);
fsync(c);
rename(a.dpkg-new, a);

... and then set the package status as unpacked.

I am guessing you are doing (a) today --- am I right?  (c) or (d)
would be best.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101126215254.gj2...@thunk.org



Re: why are there /bin and /usr/bin...

2010-08-18 Thread Ted Ts'o
On Mon, Aug 16, 2010 at 09:01:42PM +0200, Bernhard R. Link wrote:
 * Perry E. Metzger pe...@piermont.com [100816 20:21]:
  The most reasonable argument against altering such things is that
  after decades, people are used to the whole /usr thing and the fight
  to change it isn't worthwhile. That I will agree with -- see the
  emotional reactions people get when you suggest their preferred layout
  is an onion.
 
 Accusion people of irrational behaviour almost always results in
 irrational behviour. Either they were irrational already before or
 making false insulting accusations. So I should better not tell you
 that accusing people of irrational behaviour is quite irrational...

There is a rational reason for doing this at least for servers.
Having a small root partition can be a huge advantage because it
minimizes the chances that it will get corrupted.  Making the root
read-only is even better from that perspective (but generally requires
more work).  What I like to do for servers that absolutely, positively
can't go down, and for which I don't have redudant servers (mainly
because I'm too poor :-) is to have a root partition which is small
(say, half a gig) and then mirror it onto another partition on a
separate spindle, and set up grub so I can boot off of either root
partition.  (If the BIOS has a way for me to specify via a serial
console booting off of the 2nd hard drive, even better; then I can
have a duplicate grub setup on the 2nd hard drive as well.)

I used to do this for desktops as well, but these days, a rescue CD is
easy enough to use.

- Ted


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20100819005027.ga16...@thunk.org