Re: [PATCH] btrfs: Mem leak in btrfs_get_acl()

2011-01-07 Thread Aneesh Kumar K. V
On Thu, 6 Jan 2011 22:45:21 +0100 (CET), Jesper Juhl j...@chaosbits.net wrote:
 
 It seems to me that we leak the memory allocated to 'value' in 
 btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
 Here's a patch that attempts to correct that problem.
 
 Signed-off-by: Jesper Juhl j...@chaosbits.net

I posted a similar patch long time back.  But never got picked up

http://article.gmane.org/gmane.comp.file-systems.btrfs/6164

Message-id:1279547924-25141-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

 ---
  acl.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
   compile tested only.
 
 diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
 index d16..6d1410e 100644
 --- a/fs/btrfs/acl.c
 +++ b/fs/btrfs/acl.c
 @@ -60,8 +60,10 @@ static struct posix_acl *btrfs_get_acl(struct inode 
 *inode, int type)
   size = __btrfs_getxattr(inode, name, value, size);
   if (size  0) {
   acl = posix_acl_from_xattr(value, size);
 - if (IS_ERR(acl))
 + if (IS_ERR(acl)) {
 + kfree(value);
   return acl;
 + }
   set_cached_acl(inode, type, acl);
   }
   kfree(value);
 
 


-aneesh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug report: parent transid failed after heavy load

2011-01-07 Thread Arie Peterson

Dear all,


During a move of some 60GB of data from an ext4 partition to a btrfs 
partition, both on the same disk, the following happened:


- my window manager froze;
- the move suspended, i.e., no more data was written to the destination 
or deleted from the source;
- part of top output: (sorry for possible wrapping; sending this from 
webmail)


  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+ COMMAND

13610 root  20   0 000 D   26  0.0 299:45.87 
btrfs-cache-279
 1566 root  20   0 000 S   22  0.0 218:30.86 
btrfs-endio-met


(there were also a firefox instance and an npviewer.bin process still 
consuming CPU time, but I guess this was some flash movie happily 
playing along during the freeze);


- those two btrfs processes could not be terminated, not even by kill 
-9;

- lsof 1566 output (similar for the other process):

COMMANDPID USER   FD  TYPE DEVICE SIZE/OFF NODE NAME
btrfs-end 1566 root  cwd   DIR8,6 40962 /
btrfs-end 1566 root  rtd   DIR8,6 40962 /
btrfs-end 1566 root  txt   unknown  /proc/1566/exe

After a reboot, I am able to mount the btrfs filesystem, and read data 
from it, but as soon as I try any write operation (even a simple touch), 
that command hangs, and there are two btrfs processes hanging around, 
just as above; dmesg gives lots of parent transid failed messages.


My kernel is 2.6.36 (with gentoo patches).

So, the questions:

1) Is this a known problem? If so, is it fixed in a newer version?

In the archive of this list, I read about others with parent transid 
failed errors, and a recovery operation (suggested by Chris Mason), 
using btrfs-select-super 
http://www.spinics.net/lists/linux-btrfs/msg07572.html.


2) Should I try this procedure to fix my filesystem? Is there any debug 
information I should collect first? (I can recreate the two spinning 
processes by rebooting and writing to the filesystem.)



I am saddened by this failure, as this data move was actually part of 
an operation to switch over to btrfs completely, after using it without 
problems for quite a while.



Thanks for any help. Keep up the good work.


Regards,

Arie Peterson

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Mike Fleetwood
On 6 January 2011 20:01, Olaf van der Spek olafvds...@gmail.com wrote:
 Hi,

 Does btrfs support atomic file data replaces?

Hi Olaf,

Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1]

Special handling was added to ext3, ext4, btrfs (and probably other
Linux FSs) for your replace-via-truncate and the alternative
replace-via-rename application patterns.  Try reading Delayed
allocation and the zero-length file problem article and comments by
Ted Ts'o for further discussion. [2]

Mike
-- 
[1] 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5a3f23d515a2ebf0c750db80579ca57b28cbce6d
[2] 
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 2:55 PM, Mike Fleetwood
mike.fleetw...@googlemail.com wrote:
 On 6 January 2011 20:01, Olaf van der Spek olafvds...@gmail.com wrote:
 Hi,

 Does btrfs support atomic file data replaces?

 Hi Olaf,

 Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. 
 [1]

 Special handling was added to ext3, ext4, btrfs (and probably other
 Linux FSs) for your replace-via-truncate and the alternative
 replace-via-rename application patterns.  Try reading Delayed
 allocation and the zero-length file problem article and comments by
 Ted Ts'o for further discussion. [2]

According to Ted, via-truncate and via-rename are unsafe. Only fsync,
rename is safe.
Disadvantage of rename is resetting file owner (if non-root), having
issues with meta-data and other stuff.

My proposal was for an open flag, O_ATOMIC, to be introduced to tell
the FS the whole file update should be done atomically.
Ted says this is too hard in ext4, so I was wondering if this would be
possible in btrfs.

Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 3:01 PM, Olaf van der Spek olafvds...@gmail.com wrote:
 According to Ted, via-truncate and via-rename are unsafe. Only fsync,
 rename is safe.
 Disadvantage of rename is resetting file owner (if non-root), having
 issues with meta-data and other stuff.

 My proposal was for an open flag, O_ATOMIC, to be introduced to tell
 the FS the whole file update should be done atomically.
 Ted says this is too hard in ext4, so I was wondering if this would be
 possible in btrfs.

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2082
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2089
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2090
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] add metadata_incore ioctl in vfs

2011-01-07 Thread Arnd Bergmann
On Thursday 06 January 2011, Shaohua Li wrote:
 Subject: add metadata_incore ioctl in vfs
 
 Add an ioctl to dump filesystem's metadata in memory in vfs. Userspace 
 collects
 such info and uses it to do metadata readahead.
 Filesystem can hook to super_operations.metadata_incore to get metadata in
 specific approach. Next patch will give an example how to implement
 .metadata_incore in btrfs.
 
 Signed-off-by: Shaohua Li shaohua...@intel.com

Looks great!

Reviewed-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/5]add metadata_readahead ioctl in vfs

2011-01-07 Thread Arnd Bergmann
On Thursday 06 January 2011, Shaohua Li wrote:
 Subject: add metadata_readahead ioctl in vfs
 
 Add metadata readahead ioctl in vfs. Filesystem can hook to
 super_operations.metadata_readahead to handle filesystem specific task.
 Next patch will give an example how btrfs implements it.
 
 Signed-off-by: Shaohua Li shaohua...@intel.com

Reviewed-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Chris Mason
Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0500:
 Hi,
 
 Does btrfs support atomic file data replaces? Basically, the atomic
 variant of this:
 // old stage
 open(O_TRUNC)
 write() // 0+ times
 close()
 // new state

Yes and no.  We have a best effort mechanism where we try to guess that
since you've done this truncate and the write that you want the writes
to show up quickly.  But its a guess.

The problem is the write() // 0+ times.  The kernel has no idea what
new result you want the file to contain because the application isn't
telling us.

What btrfs can do (but we haven't yet implemented) is make sure that the
results of a single write file are on disk atomically, even if they are
replacing existing bytes in the file.

Because we cow and because we don't update metadata pointers until the
IO is complete, we can wait until all the IO for a given write call is
on disk before we update any of the metadata.

This isn't hard, it's on my TODO list.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0500:
 Hi,

 Does btrfs support atomic file data replaces? Basically, the atomic
 variant of this:
 // old stage
 open(O_TRUNC)
 write() // 0+ times
 close()
 // new state

 Yes and no.  We have a best effort mechanism where we try to guess that
 since you've done this truncate and the write that you want the writes
 to show up quickly.  But its a guess.

 The problem is the write() // 0+ times.  The kernel has no idea what
 new result you want the file to contain because the application isn't
 telling us.

Isn't it safe for the kernel to wait until the first write or close
before writing anything to disk?

 What btrfs can do (but we haven't yet implemented) is make sure that the
 results of a single write file are on disk atomically, even if they are
 replacing existing bytes in the file.

 Because we cow and because we don't update metadata pointers until the
 IO is complete, we can wait until all the IO for a given write call is
 on disk before we update any of the metadata.

 This isn't hard, it's on my TODO list.

What about a new flag: O_ATOMIC that'd take the guesswork out of the kernel?

Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Chris Mason
Excerpts from Olaf van der Spek's message of 2011-01-07 10:01:59 -0500:
 On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason chris.ma...@oracle.com wrote:
  Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0500:
  Hi,
 
  Does btrfs support atomic file data replaces? Basically, the atomic
  variant of this:
  // old stage
  open(O_TRUNC)
  write() // 0+ times
  close()
  // new state
 
  Yes and no.  We have a best effort mechanism where we try to guess that
  since you've done this truncate and the write that you want the writes
  to show up quickly.  But its a guess.
 
  The problem is the write() // 0+ times.  The kernel has no idea what
  new result you want the file to contain because the application isn't
  telling us.
 
 Isn't it safe for the kernel to wait until the first write or close
 before writing anything to disk?

I'm afraid not.  Picture an application that opens a thousand files and
writes 1MB to each of them, and then didn't close any.  If we waited
until close, you'd have 1GB of memory pinned or staged somehow.

 
  What btrfs can do (but we haven't yet implemented) is make sure that the
  results of a single write file are on disk atomically, even if they are
  replacing existing bytes in the file.
 
  Because we cow and because we don't update metadata pointers until the
  IO is complete, we can wait until all the IO for a given write call is
  on disk before we update any of the metadata.
 
  This isn't hard, it's on my TODO list.
 
 What about a new flag: O_ATOMIC that'd take the guesswork out of the kernel?

We can't guess beyond a single write call.  Otherwise we get into
the problem above where an application can force the kernel to wait
forever.  I'm not against O_ATOMIC to enable the new btrfs
functionality, but it will still be limited to one write.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason chris.ma...@oracle.com wrote:
  The problem is the write() // 0+ times.  The kernel has no idea what
  new result you want the file to contain because the application isn't
  telling us.

 Isn't it safe for the kernel to wait until the first write or close
 before writing anything to disk?

 I'm afraid not.  Picture an application that opens a thousand files and
 writes 1MB to each of them, and then didn't close any.  If we waited
 until close, you'd have 1GB of memory pinned or staged somehow.

That's not what I asked. ;)
I asked to wait until the first write (or close). That way, you don't
get unintentional empty files.
One step further, you don't have to keep the data in memory, you're
free to write them to disk. You just wouldn't update the meta-data
(yet).

  This isn't hard, it's on my TODO list.

 What about a new flag: O_ATOMIC that'd take the guesswork out of the kernel?

 We can't guess beyond a single write call.  Otherwise we get into
 the problem above where an application can force the kernel to wait
 forever.  I'm not against O_ATOMIC to enable the new btrfs
 functionality, but it will still be limited to one write.

 -chris




-- 
Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Chris Mason
Excerpts from Olaf van der Spek's message of 2011-01-07 10:08:24 -0500:
 On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason chris.ma...@oracle.com wrote:
   The problem is the write() // 0+ times.  The kernel has no idea what
   new result you want the file to contain because the application isn't
   telling us.
 
  Isn't it safe for the kernel to wait until the first write or close
  before writing anything to disk?
 
  I'm afraid not.  Picture an application that opens a thousand files and
  writes 1MB to each of them, and then didn't close any.  If we waited
  until close, you'd have 1GB of memory pinned or staged somehow.
 
 That's not what I asked. ;)
 I asked to wait until the first write (or close). That way, you don't
 get unintentional empty files.
 One step further, you don't have to keep the data in memory, you're
 free to write them to disk. You just wouldn't update the meta-data
 (yet).

Sorry ;) Picture an application that truncates 1024 files without closing any
of them.  Basically any operation that includes the kernel waiting for
applications because they promise to do something soon is a denial of
service attack, or a really easy way to run out of memory on the box.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason chris.ma...@oracle.com wrote:
 That's not what I asked. ;)
 I asked to wait until the first write (or close). That way, you don't
 get unintentional empty files.
 One step further, you don't have to keep the data in memory, you're
 free to write them to disk. You just wouldn't update the meta-data
 (yet).

 Sorry ;) Picture an application that truncates 1024 files without closing any
 of them.  Basically any operation that includes the kernel waiting for
 applications because they promise to do something soon is a denial of
 service attack, or a really easy way to run out of memory on the box.

I'm not sure why you would run out of memory in that case.

O_ATOMIC would be the solution for the rename workaround: write temp
file, rename
With advantages like a way simpler API, no issues with resetting
meta-data, no issues with temp file and maybe better performance.

Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Chris Mason
Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500:
 On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason chris.ma...@oracle.com wrote:
  That's not what I asked. ;)
  I asked to wait until the first write (or close). That way, you don't
  get unintentional empty files.
  One step further, you don't have to keep the data in memory, you're
  free to write them to disk. You just wouldn't update the meta-data
  (yet).
 
  Sorry ;) Picture an application that truncates 1024 files without closing 
  any
  of them.  Basically any operation that includes the kernel waiting for
  applications because they promise to do something soon is a denial of
  service attack, or a really easy way to run out of memory on the box.
 
 I'm not sure why you would run out of memory in that case.

Well, lets make sure I've got a good handle on the proposed interface:

1) fd = open(some_file, O_ATOMIC)
2) truncate(fd, 0)
3) write(fd, new data)

The semantics are that we promise not to let the truncate hit the disk
until the application does the write.

We have a few choices on how we do this:

1) Leave the disk untouched, but keep something in memory that says this
inode is really truncated

2) Record on disk that we've done our atomic truncate but it is still
pending.  We'd need some way to remove or invalidate this record after a
crash.

3) Go ahead and do the operation but don't allow the transaction to
commit until the write is done.

option #1: keep something in memory.  Well, any time we have a
requirement to pin something in memory until userland decides to do a
write, we risk oom.

option #2: disk format change.  Actually somewhat complex because if we
haven't crashed, we need to be able to read the inode in again without
invalidating the record but if we do crash, we have to invalidate the
record.  Not impossible, but not trivial.

option #3: Pin the whole transaction.  Depending on the FS this may be
impossible.  Certain operations require us to commit the transaction to
reclaim space, and we cannot allow userland to put that on hold without
deadlocking.

What most people don't realize about the crash safe filesystems is they
don't have fine grained transactions.  There is one single transaction
for all the operations done.  This is mostly because it is less complex
and much faster, but it also makes any 'pin the whole transaction' type
system unusable.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Synching a Backup Server

2011-01-07 Thread Hubert Kario
On Friday, January 07, 2011 00:07:37 Carl Cook wrote:
 On Thu 06 January 2011 14:26:30 Carl Cook wrote:
  According To Doyle...
 
 Er, Hoyle...
 
 I am trying to create a multi-device BTRFS system using two identical
 drives. I want them to be raid 0 for no redunancy, and a total of 4TB.
 But in the wiki it says nothing about using fdisk to set up the drive
 first.  It just basically says for me to: mkfs.btrfs -m raid0 /dev/sdc
 /dev/sdd

I'd suggest at least 
mkfs.btrfs -m raid1 -d raid0 /dev/sdc /dev/sdd
if you really want raid0

 
 Seems to me that for mdadm I had to set each drive as a raid member,
 assemble the array, then format.  Is this not the case with BTRFS?
 
 Also in the wiki it says After a reboot or reloading the btrfs module,
 you'll need to use btrfs device scan to discover all multi-device
 filesystems on the machine.  Is this not done automatically?  Do I have
 to set up some script to do this?
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason chris.ma...@oracle.com wrote:
 I'm not sure why you would run out of memory in that case.

 Well, lets make sure I've got a good handle on the proposed interface:

 1) fd = open(some_file, O_ATOMIC)

No, O_TRUNC should be used in open. Maybe it works with a separate truncate too.

 2) truncate(fd, 0)
 3) write(fd, new data)

 The semantics are that we promise not to let the truncate hit the disk
 until the application does the write.

 We have a few choices on how we do this:

 1) Leave the disk untouched, but keep something in memory that says this
 inode is really truncated

 2) Record on disk that we've done our atomic truncate but it is still
 pending.  We'd need some way to remove or invalidate this record after a
 crash.

 3) Go ahead and do the operation but don't allow the transaction to
 commit until the write is done.

 option #1: keep something in memory.  Well, any time we have a
 requirement to pin something in memory until userland decides to do a
 write, we risk oom.

Since the file is open, you have to keep something in memory anyway,
right? Adding a bit (or bool) does not make a difference IMO.
Isn't this comparable to opening a temp file?

 option #2: disk format change.  Actually somewhat complex because if we
 haven't crashed, we need to be able to read the inode in again without
 invalidating the record but if we do crash, we have to invalidate the
 record.  Not impossible, but not trivial.

 option #3: Pin the whole transaction.  Depending on the FS this may be
 impossible.  Certain operations require us to commit the transaction to
 reclaim space, and we cannot allow userland to put that on hold without
 deadlocking.

#1 is the only one that makes sense.

 What most people don't realize about the crash safe filesystems is they
 don't have fine grained transactions.  There is one single transaction
 for all the operations done.  This is mostly because it is less complex
 and much faster, but it also makes any 'pin the whole transaction' type
 system unusable.

AFAIK the cost is mostly more complex code / runtime. The cost is not
disk performance.

-- 
Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Synching a Backup Server

2011-01-07 Thread Hubert Kario
On Thursday, January 06, 2011 22:52:25 Freddie Cash wrote:
 On Thu, Jan 6, 2011 at 1:42 PM, Carl Cook cac...@quantum-sci.com wrote:
  On Thu 06 January 2011 11:16:49 Freddie Cash wrote:
   Also with this system, I'm concerned that if there is corruption on
   the HTPC, it could be propagated to the backup server.  Is there some
   way to address this?  Longer intervals to sync, so I have a chance to
   discover?
  
  Using snapshots on the backup server allows you to go back in time to
  recover files that may have been accidentally deleted, or to recover
  files that have been corrupted.
  
  How?  I can see that rsync will not transfer the files that have not
  changed, but I assume it transfers the changed ones.  How can you go
  back in time?  Is there like a snapshot file that records the state of
  all files there?
 
 I don't know the specifics of how it works in btrfs, but it should be
 similar to how ZFS does it.  The gist of it is:
 
 Each snapshot gives you a point-in-time view of the entire filesystem.
  Each snapshot can be mounted (ZFS is read-only; btrfs is read-only or
 read-write).  So, you mount the snapshot for 2010-12-15 onto /mnt,
 then cd to the directory you want (/mnt/htpc/home/fcash/videos/) and
 copy the file out that you want to restore (cp coolvid.avi ~/).
 
 With ZFS, things are nice and simple:
   - each filesystem has a .zfs/snapshot directory
   - in there are sub-directories, each named after the snapshot name
   - cd into the snapshot name, the OS auto-mounts the snapshot, and off you
 go
 
 Btrfs should be similar?  Don't know the specifics.
 
 How it works internally, is some of the magic and the beauty of
 Copy-on-Write filesystems.  :)

I usually create subvolumes in btrfs root volume:

/mnt/btrfs/
|- server-a
|- server-b
\- server-c

then create snapshots of these directories:

/mnt/btrfs/
|- server-a
|- server-b
|- server-c
|- snapshots-server-a
 |- @GMT-2010.12.21-16.48.09
\- @GMT-2010.12.22-16.45.14
|- snapshots-server-b
\- snapshots-server-c

This way I can use the shadow_copy module for samba to publish the snapshots  
to windows clients.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Hubert Kario
On Friday, January 07, 2011 17:12:11 Chris Mason wrote:
 Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500:
  On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason chris.ma...@oracle.com 
wrote:
   That's not what I asked. ;)
   I asked to wait until the first write (or close). That way, you don't
   get unintentional empty files.
   One step further, you don't have to keep the data in memory, you're
   free to write them to disk. You just wouldn't update the meta-data
   (yet).
   
   Sorry ;) Picture an application that truncates 1024 files without
   closing any of them.  Basically any operation that includes the kernel
   waiting for applications because they promise to do something soon is
   a denial of service attack, or a really easy way to run out of memory
   on the box.
  
  I'm not sure why you would run out of memory in that case.
 
 Well, lets make sure I've got a good handle on the proposed interface:
 
 1) fd = open(some_file, O_ATOMIC)
 2) truncate(fd, 0)
 3) write(fd, new data)
 
 The semantics are that we promise not to let the truncate hit the disk
 until the application does the write.
 
 We have a few choices on how we do this:
 
 1) Leave the disk untouched, but keep something in memory that says this
 inode is really truncated
 
 2) Record on disk that we've done our atomic truncate but it is still
 pending.  We'd need some way to remove or invalidate this record after a
 crash.
 
 3) Go ahead and do the operation but don't allow the transaction to
 commit until the write is done.
 
 option #1: keep something in memory.  Well, any time we have a
 requirement to pin something in memory until userland decides to do a
 write, we risk oom.

Userland has already a file descriptor allocated (which can fail anyway 
because of OOM), I see no problem in increasing the size of kernel memory 
usage by 4 bytes (if not less) just to note that the application wants to see 
the file as truncated (1 bit) and the next write has to be atomic (2nd bit?).

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Massimo Maggi
Are you suggesting to do:
1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
2)application writes to that fd, with one or more system calls, in a
short time or in long time, at his will.
3)at fclose (or even at fsync ) atomically swap data pointer of real
file with temp file, then delete temp.In a transparent mode to
userland.  (something similar to e4defrag).
Is this sum up correct?

Massimo Maggi

Il 07/01/2011 16:17, Olaf van der Spek ha scritto:
 On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason chris.ma...@oracle.com wrote:
 That's not what I asked. ;)
 I asked to wait until the first write (or close). That way, you don't
 get unintentional empty files.
 One step further, you don't have to keep the data in memory, you're
 free to write them to disk. You just wouldn't update the meta-data
 (yet).
 Sorry ;) Picture an application that truncates 1024 files without closing any
 of them.  Basically any operation that includes the kernel waiting for
 applications because they promise to do something soon is a denial of
 service attack, or a really easy way to run out of memory on the box.
 I'm not sure why you would run out of memory in that case.

 O_ATOMIC would be the solution for the rename workaround: write temp
 file, rename
 With advantages like a way simpler API, no issues with resetting
 meta-data, no issues with temp file and maybe better performance.

 Olaf
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Olaf van der Spek
On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi mass...@.it wrote:
 Are you suggesting to do:
 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
 2)application writes to that fd, with one or more system calls, in a
 short time or in long time, at his will.
 3)at fclose (or even at fsync ) atomically swap data pointer of real
 file with temp file, then delete temp.In a transparent mode to
 userland.  (something similar to e4defrag).
 Is this sum up correct?

Almost. Swap should probably not be done at fsync time.
Other open references (for example running executables) should be swapped too.

The new-file case has to be handled too.

Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption?

2011-01-07 Thread Matt
On Thu, Jan 6, 2011 at 4:56 PM, Heinz Diehl h...@fancy-poultry.org wrote:
 On 05.12.2010, Milan Broz wrote:

 It still seems to like dmcrypt with its parallel processing is just
 trigger to another bug in 37-rc.

 To come back to this: my 3 systems (XFS filesystem) running the latest
 dm-crypt-scale-to-multiple-cpus patch from Andi Kleen/Milan Broz have
 not showed a single problem since 2.6.37-rc6 and above. No corruption any
 longer, no freezes, nothing. The patch applies cleanly to 2.6.37, too,
 and runs just fine.

 I blindly guess that my data corruption problem was related to something else 
 in the
 2.6.37-rc series up to -rc4/5.

 Since this patch is a significant improvement: any chance that it finally gets
 merged into mainline/stable?



Hi Heinz,

I've been using this patch since 2.6.37-rc6+ with ext4 and xfs
filesystems and haven't seen any corruptions since then
(ext4 got fixed since 2.6.37-rc6, xfs showed no problems from the start)

http://git.eu.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1449032be17abb69116dbc393f67ceb8bd034f92
(is the actual temporary fix for ext4)

Regards

Matt
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Various Questions

2011-01-07 Thread Carl Cook

On Fri 07 January 2011 08:14:17 Hubert Kario wrote:
 I'd suggest at least 
 mkfs.btrfs -m raid1 -d raid0 /dev/sdc /dev/sdd
 if you really want raid0

I don't fully understand -m or -d.  Why would this make a truer raid0 that with 
no options?


Is it necessary to use fdisk on new drives in creating a BTRFS multi-drive 
array?  Or is this all that's needed:
# mkfs.btrfs /dev/sdb /dev/sdc
# btrfs filesystem show

Is this related to 'subvolumes'?  The FAQ implies that a subvolume is like a 
directory, but also like a partition.  What's the rationale for being able to 
create a subvolume under a subvolume, as Hubert says so he can use the 
shadow_copy module for samba to publish the snapshots  to windows clients.  I 
don't have any windows clients, but what difference does his structure make?

I know that if using SATA+LVM, turn off the writeback cache on the drive, as it 
doesn't do cash flushing, and ensure NCQ is on.  But does this also apply to a 
BTRFS array?  If so, is this done in rc.local with 
hdparm -I /dev/sdb
hdparm -I /dev/sdc


How do you know what options to rsync are on by default?  I can't find this 
anywhere.  For example, it seems to me that --perms -ogE  --hard-links and 
--delete-excluded should be on by default, for a true sync?

If using the  --numeric-ids switch for rsync, do you just have to manually make 
sure the IDs and usernames are the same on source and destination machines?

For files that fail to transfer, wouldn't it be wise to use  --partial-dir=DIR 
to at least recover part of lost files?

The rsync man page says that rsync uses ssh by default, but is that the case?  
I think -e may be related to engaging ssh, but don't understand the explanation.

So for my system where there is a backup server, I guess I run the rsync daemon 
on the backup server which presents a port, then when the other systems decide 
it's time for a backup (cron) they:
- stop mysql, dump the database somewhere, start mysql;
- connect to the backup server's rsync port and dump their data to (hopefully) 
some specific place there.
Right?




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Various Questions

2011-01-07 Thread C Anthony Risinger
On Fri, Jan 7, 2011 at 11:15 AM, Carl Cook cac...@quantum-sci.com wrote:

 On Fri 07 January 2011 08:14:17 Hubert Kario wrote:
 I'd suggest at least
 mkfs.btrfs -m raid1 -d raid0 /dev/sdc /dev/sdd
 if you really want raid0

 I don't fully understand -m or -d.  Why would this make a truer raid0 that 
 with no options?

this will give you RAID0 for your data, but RAID1 for your metadata,
making it less likely that the FS itself gets corrupted, even though
you will lose some data in crash cases, if i understand correctly.

 Is it necessary to use fdisk on new drives in creating a BTRFS multi-drive 
 array?  Or is this all that's needed:
 # mkfs.btrfs /dev/sdb /dev/sdc
 # btrfs filesystem show

depends on whether you need /boot partitions or other partitions.
what you have works fine though.

 Is this related to 'subvolumes'?  The FAQ implies that a subvolume is like a 
 directory, but also like a partition.  What's the rationale for being able to 
 create a subvolume under a subvolume, as Hubert says so he can use the 
 shadow_copy module for samba to publish the snapshots  to windows clients.  
 I don't have any windows clients, but what difference does his structure make?

just his preference to put it there... the snapshot of a snapshot can
go anywhere.  it doesn't have to reside under it's parent, the
parent was just used as a base, it's not bound to it in any way AFAIK.

 How do you know what options to rsync are on by default?  I can't find this 
 anywhere.  For example, it seems to me that --perms -ogE  --hard-links and 
 --delete-excluded should be on by default, for a true sync?

the links and command Freddie Cash posted are a really good base to work from.

 So for my system where there is a backup server, I guess I run the rsync 
 daemon on the backup server which presents a port, then when the other 
 systems decide it's time for a backup (cron) they:
 - stop mysql, dump the database somewhere, start mysql;
 - connect to the backup server's rsync port and dump their data to 
 (hopefully) some specific place there.
 Right?

you don't have to stop mysql, you just need to freeze any new,
incoming writes, and flush (ie. let finish) whatever is happening
right now.  this ensures mysql is _internally_ consistent on the disk.

see comment by Lloyd Standish here:

http://dev.mysql.com/doc/refman/5.1/en/backup-methods.html

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Various Questions

2011-01-07 Thread Freddie Cash
On Fri, Jan 7, 2011 at 9:15 AM, Carl Cook cac...@quantum-sci.com wrote:
 How do you know what options to rsync are on by default?  I can't find this 
 anywhere.  For example, it seems to me that --perms -ogE  --hard-links and 
 --delete-excluded should be on by default, for a true sync?

Who cares which ones are on by default?  List the ones you want to use
on the command-line, everytime.  That way, if the defaults change,
your setup won't.

 If using the  --numeric-ids switch for rsync, do you just have to manually 
 make sure the IDs and usernames are the same on source and destination 
 machines?

You use the --numeric-ids switch so that it *doesn't* matter if the
IDs/usernames are the same.  It just sends the ID number on the wire.
Sure, if you do an ls on the backup box, the username will appear to
be messed up.  But if you compare the user ID assigned to the file,
and the user ID to the backed up etc/passwd file, they are correct.
Then, if you ever need to restore the HTPC from backups, the
etc/passwd file is transferred over, the user IDs are transferred
over, and when you do an ls on the HTPC, everything matches up
correctly.

 For files that fail to transfer, wouldn't it be wise to use  
 --partial-dir=DIR to at least recover part of lost files?

Or, just run rsync again, if the connection is dropped.

 The rsync man page says that rsync uses ssh by default, but is that the case? 
  I think -e may be related to engaging ssh, but don't understand the 
 explanation.

Does it matter what the default is, if you specify exactly how you
want it to work on the command-line?

 So for my system where there is a backup server, I guess I run the rsync 
 daemon on the backup server which presents a port, then when the other 
 systems decide it's time for a backup (cron) they:
 - stop mysql, dump the database somewhere, start mysql;
 - connect to the backup server's rsync port and dump their data to 
 (hopefully) some specific place there.
 Right?

That's one way (push backups).  It works ok for small numbers of
systems being backed up.  But get above a handful of machines, and it
gets very hard to time everything so that you don't hammer the disks
on the backup server.

Pull backups (backups server does everything) works better, in my
experience.  Then you just script things up once, run 1 script, worry
about 1 schedule, and everything is stored on the backups server.  No
need to run rsync daemons everywhere, just run the rsync client, using
-e ssh, and let it do everything.

If you need it to run a script on the remote machine first, that's
easy enough to do:
  - ssh to remote system, run script to stop DBs, dump DBs, snapshot
FS, whatever
  - then run rsync
  - ssh to remote system run script to start DBs, delete snapshot, whatever

You're starting to over-think things.  Keep it simple, don't worry
about defaults, specify everything you want to do, and do it all from
the backups box.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Various Questions

2011-01-07 Thread Carl Cook

Wow, this rsync and backup system is pretty amazing.  I've always just tarred 
each directory manually, but now find I can RELIABLY automate backups, and have 
SOLID versioning to boot.  Thanks to everyone who advised, especially Freddie 
and Anthony.

I am still waiting for hardware for my backup server, but have been preparing.  
On the backup server I'll be doing pull backups for everything except my phone 
(which is connected intermittently).  I'm going to set up a cron script on the 
backup server to pull backups once a week (as opposed to once/mo which I've 
done for 12 years).  I am at a loss how to to lock the database on the HTPC 
while exporting the dump, as per Lloyd Standish, but will study it.  (Freddie 
gave a nice script, but it doesn't seem to lock/flush first)  Also don't know 
how to email results/success/fail on completion, as I've not a very good coder.

But here is my proposed cron:
btrfs subvolume snapshot hex:///home /media/backups/snapshots/hex-{DATE}
rsync --archive --hard-links --delete-during --delete-excluded --inplace 
--numeric-ids -e ssh --exclude-from=/media/backups/exclude-hex hex:///home 
/media/backups/hex
btrfs subvolume snapshot droog:///home /media/backups/snapshots/droog-{DATE}
rsync --archive --hard-links --delete-during --delete-excluded --inplace 
--numeric-ids -e ssh --exclude-from=/media/backups/exclude-droog droog:///home 
/media/backups/droog

My root filesystems are ext4, so I guess they cannot be snapshotted before 
backup.  My home directories are/will be BTRFS though.


On Fri 07 January 2011 08:14:17 Hubert Kario wrote:
 I'd suggest at least 
 mkfs.btrfs -m raid1 -d raid0 /dev/sdc /dev/sdd
 if you really want raid0

 I don't fully understand -m or -d.  Why would this make a truer raid0 that 
 with no options?

I am beginning to suspect that this is the -default- behavior, as described in 
the wiki:
# Create a filesystem across four drives (metadata mirrored, data striped)

Should I turn off the writeback cache on each drive when running BTRFS?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


open_ctree failed, unable to mount the fs

2011-01-07 Thread Tomasz Chmielewski
I got a power cycle, after which I'm no longer able to mount btrfs 
filesystem:



device fsid x-y devid 1 transid 169686 /dev/vda3
device fsid x-y devid 1 transid 169686 /dev/vda3
parent transid verify failed on 3260289024 wanted 169686 found 169685
parent transid verify failed on 3260289024 wanted 169686 found 169685
parent transid verify failed on 3260289024 wanted 169686 found 169685
btrfs: open_ctree failed


Tried to get that mounted with 2.6.35 and 2.6.37, without success.

Is there a way to fix it?


--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread Hugo Mills
On Fri, Jan 07, 2011 at 08:01:47PM +0100, Tomasz Chmielewski wrote:
 I got a power cycle, after which I'm no longer able to mount btrfs
 filesystem:
 
 
 device fsid x-y devid 1 transid 169686 /dev/vda3
 device fsid x-y devid 1 transid 169686 /dev/vda3
 parent transid verify failed on 3260289024 wanted 169686 found 169685
 parent transid verify failed on 3260289024 wanted 169686 found 169685
 parent transid verify failed on 3260289024 wanted 169686 found 169685
 btrfs: open_ctree failed
 
 
 Tried to get that mounted with 2.6.35 and 2.6.37, without success.
 
 Is there a way to fix it?

   The forthcoming[1] btrfsck tool should handle that particular
error, I believe.

   To prevent it from happening again, ensure that you have working
barriers on your disks, or that you turn off write caching on the
drives at every boot.

   Hugo.

[1] out real soon now

-- 
=== Hugo Mills: h...@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Well, sir, the floor is yours.  But remember, the ---
  roof is ours!  


signature.asc
Description: Digital signature


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread Tomasz Chmielewski
 The forthcoming[1] btrfsck tool should handle that particular
 error, I believe.

I noticed a similar problem was discussed here, with a solution:

http://www.spinics.net/lists/linux-btrfs/msg07572.html


where a btrfs-selects-super was used:

 git clone
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
 next
 
 (or git pull into your existing checkout)
 
 Then
 
 make btrfs-select-super
 ./btrfs-selects-super -s 1 /dev/xxx
 
 After this you'll want to do a full backup and make sure things are
 working properly.


However, I don't see the tool when I clone the latest git - am I missing 
something?


FYI, btrfsck -s 1 /dev/vda3 says:

using SB copy 1, bytenr 67108864
fs tree 265 refs 1 not found
unresolved ref root 284 dir 115450 index 2 namelen 19 name 
2010-10-17-05:15:23 error 600
fs tree 274 refs 1 not found
unresolved ref root 284 dir 115450 index 3 namelen 19 name 
2010-10-24-05:15:20 error 600
fs tree 277 refs 1 not found
unresolved ref root 284 dir 115392 index 20 namelen 19 name 
2010-10-26-05:15:21 error 600
fs tree 278 refs 1 not found
unresolved ref root 284 dir 115392 index 21 namelen 19 name 
2010-10-27-05:15:23 error 600
fs tree 279 refs 1 not found
unresolved ref root 284 dir 115392 index 22 namelen 19 name 
2010-10-28-05:18:28 error 600
fs tree 280 refs 1 not found
unresolved ref root 284 dir 115392 index 23 namelen 19 name 
2010-10-29-05:15:21 error 600
fs tree 281 refs 1 not found
unresolved ref root 284 dir 115392 index 24 namelen 19 name 
2010-10-30-05:15:27 error 600
fs tree 282 refs 1 not found
unresolved ref root 284 dir 115450 index 4 namelen 19 name 
2010-10-31-05:15:21 error 600
fs tree 283 refs 1 not found
unresolved ref root 284 dir 115392 index 25 namelen 19 name 
2010-10-31-05:15:21 error 600
fs tree 284 refs 13 
unresolved ref root 284 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 319 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 340 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 348 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 355 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 357 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 358 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 359 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 360 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 361 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 362 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
unresolved ref root 363 dir 115453 index 2 namelen 19 name 
2010-11-01-05:15:26 error 600
fs tree 299 refs 1 not found
unresolved ref root 319 dir 115450 index 6 namelen 19 name 
2010-11-14-05:15:25 error 600
fs tree 307 refs 1 not found
unresolved ref root 319 dir 115450 index 7 namelen 19 name 
2010-11-21-05:18:23 error 600
fs tree 312 refs 1 not found
unresolved ref root 319 dir 115392 index 50 namelen 19 name 
2010-11-25-05:15:25 error 600
fs tree 313 refs 1 not found
unresolved ref root 319 dir 115392 index 51 namelen 19 name 
2010-11-26-05:15:24 error 600
fs tree 314 refs 1 not found
unresolved ref root 319 dir 115392 index 52 namelen 19 name 
2010-11-27-05:15:27 error 600
fs tree 315 refs 2 not found
unresolved ref root 319 dir 115450 index 8 namelen 19 name 
2010-11-28-05:15:22 error 600
unresolved ref root 340 dir 115450 index 8 namelen 19 name 
2010-11-28-05:15:22 error 600
fs tree 316 refs 1 not found
unresolved ref root 319 dir 115392 index 53 namelen 19 name 
2010-11-28-05:15:22 error 600
fs tree 317 refs 1 not found
unresolved ref root 319 dir 115392 index 54 namelen 19 name 
2010-11-29-05:15:25 error 600
fs tree 318 refs 1 not found
unresolved ref root 319 dir 115392 index 55 namelen 19 name 
2010-11-30-05:15:24 error 600
fs tree 319 refs 12 
unresolved ref root 319 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 340 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 348 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 355 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 357 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 358 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 error 600
unresolved ref root 359 dir 115453 index 3 namelen 19 name 
2010-12-01-05:15:29 

Re: Atomic file data replace API

2011-01-07 Thread Chris Mason
Excerpts from Hubert Kario's message of 2011-01-07 11:26:02 -0500:
 On Friday, January 07, 2011 17:12:11 Chris Mason wrote:
  Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500:
   On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason chris.ma...@oracle.com 
 wrote:
That's not what I asked. ;)
I asked to wait until the first write (or close). That way, you don't
get unintentional empty files.
One step further, you don't have to keep the data in memory, you're
free to write them to disk. You just wouldn't update the meta-data
(yet).

Sorry ;) Picture an application that truncates 1024 files without
closing any of them.  Basically any operation that includes the kernel
waiting for applications because they promise to do something soon is
a denial of service attack, or a really easy way to run out of memory
on the box.
   
   I'm not sure why you would run out of memory in that case.
  
  Well, lets make sure I've got a good handle on the proposed interface:
  
  1) fd = open(some_file, O_ATOMIC)
  2) truncate(fd, 0)
  3) write(fd, new data)
  
  The semantics are that we promise not to let the truncate hit the disk
  until the application does the write.
  
  We have a few choices on how we do this:
  
  1) Leave the disk untouched, but keep something in memory that says this
  inode is really truncated
  
  2) Record on disk that we've done our atomic truncate but it is still
  pending.  We'd need some way to remove or invalidate this record after a
  crash.
  
  3) Go ahead and do the operation but don't allow the transaction to
  commit until the write is done.
  
  option #1: keep something in memory.  Well, any time we have a
  requirement to pin something in memory until userland decides to do a
  write, we risk oom.
 
 Userland has already a file descriptor allocated (which can fail anyway 
 because of OOM), I see no problem in increasing the size of kernel memory 
 usage by 4 bytes (if not less) just to note that the application wants to see 
 the file as truncated (1 bit) and the next write has to be atomic (2nd bit?).
 

The exact amount of tracking is going to vary.  The reason why is that
actually doing the truncate is an O(size of the file) operation and so
you can't just flip a switch when the write or the close comes in.  You
have to run through all the metadata of the file and do something
temporary with each part that is only completed when the file IO is
actually done.

Honestly, there many different ways to solve this in the application.
Requiring high speed atomic replacement of individual file contents is a
recipe for frustration.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread cwillu
On Fri, Jan 7, 2011 at 1:25 PM, Tomasz Chmielewski man...@wpkg.org wrote:
 The forthcoming[1] btrfsck tool should handle that particular
 error, I believe.

 I noticed a similar problem was discussed here, with a solution:

 http://www.spinics.net/lists/linux-btrfs/msg07572.html


 where a btrfs-selects-super was used:

 git clone
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
 next

 (or git pull into your existing checkout)

 Then

 make btrfs-select-super
 ./btrfs-selects-super -s 1 /dev/xxx

 After this you'll want to do a full backup and make sure things are
 working properly.


 However, I don't see the tool when I clone the latest git - am I missing 
 something?

It's not built by the makefile by default;  make btrfs-select-super
as stated above will make it.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Thomas Bellman

Olaf van der Spek wrote:


On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi mass...@.it wrote:

Are you suggesting to do:
1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
2)application writes to that fd, with one or more system calls, in a
short time or in long time, at his will.
3)at fclose (or even at fsync ) atomically swap data pointer of real
file with temp file, then delete temp.In a transparent mode to
userland.  (something similar to e4defrag).
Is this sum up correct?


Almost. Swap should probably not be done at fsync time.
Other open references (for example running executables) should be swapped too.


What is the visibility of the changes for other processes supposed
to be in the meantime?  I.e., if things happen in this order:

1. Process A does fda = open(foo.txt, O_TRUNC|O_ATOMIC)
2. Process B does fdb = open(foo.txt, O_RDONLY)
3. B does read(fdb, buf, 4096)
4. A does write(fda, NEW DATA\n, 9)
5. Process C comes in and does fdc = open(foo.txt, O_RDONLY)
6. C does read(fdc, buf, 4096)
7. A calls close(fda)

Does B see an empty file, or does it see the old contents of
the file?  Does C see NEW DATA\n, or does it see the old
contents of the file, or perhaps an empty file?


/Bellman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread Tomasz Chmielewski

On 07.01.2011 20:46, cwillu wrote:


However, I don't see the tool when I clone the latest git - am I missing 
something?


It's not built by the makefile by default;  make btrfs-select-super
as stated above will make it.


$ grep select Makefile
$ grep super Makefile
$ grep -r select-super *
$ grep -r selects-super *


I used:

git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git 
next



--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'open_ctree failed', unable to mount the fs

2011-01-07 Thread Ken D'Ambrosio
On Fri, January 7, 2011 2:09 pm, Hugo Mills wrote:
 On Fri, Jan 07, 2011 at 08:01:47PM +0100, Tomasz Chmielewski wrote:

 I got a power cycle, after which I'm no longer able to mount btrfs
 filesystem:
[...]
 The forthcoming[1] btrfsck tool should handle that particular
 error, I believe.

I tried it with the btrfsck in the git repo (last week), and wound up
with... a brandy-new, blank btrfs partition.  Not *quite* what I was
looking for.  But at least it mounted.

-K




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread cwillu
On Fri, Jan 7, 2011 at 2:01 PM, Tomasz Chmielewski man...@wpkg.org wrote:
 On 07.01.2011 20:46, cwillu wrote:

 However, I don't see the tool when I clone the latest git - am I missing
 something?

 It's not built by the makefile by default;  make btrfs-select-super
 as stated above will make it.

 $ grep select Makefile
 $ grep super Makefile
 $ grep -r select-super *
 $ grep -r selects-super *


 I used:

 git clone
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
 next

You checked out the master branch into a folder called next.  -b
next is the option to checkout a specific branch.  From your existing
checkout, git checkout -t origin/next will switch to that branch.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'open_ctree failed', unable to mount the fs

2011-01-07 Thread cwillu
On Fri, Jan 7, 2011 at 2:02 PM, Ken D'Ambrosio k...@jots.org wrote:
 On Fri, January 7, 2011 2:09 pm, Hugo Mills wrote:
 On Fri, Jan 07, 2011 at 08:01:47PM +0100, Tomasz Chmielewski wrote:

 I got a power cycle, after which I'm no longer able to mount btrfs
 filesystem:
 [...]
 The forthcoming[1] btrfsck tool should handle that particular
 error, I believe.

 I tried it with the btrfsck in the git repo (last week), and wound up
 with... a brandy-new, blank btrfs partition.  Not *quite* what I was
 looking for.  But at least it mounted.

btrfsck in git hasn't been updated since October; the upcoming fsck
work isn't public, presumably to avoid making things worse until it's
working right.  The current btrfsck doesn't write to the partition as
far as I'm aware.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: open_ctree failed, unable to mount the fs

2011-01-07 Thread Tomasz Chmielewski

On 07.01.2011 21:18, cwillu wrote:


You checked out the master branch into a folder called next.  -b
next is the option to checkout a specific branch.  From your existing
checkout, git checkout -t origin/next will switch to that branch.


Good catch - thanks for a hint.

The filesystem mounted and was usable.

--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfsck segmentation fault

2011-01-07 Thread Andrew Schretter
I have a 10TB btrfs filesystem over iSCSI that is currently unmountable.  I'm
currently running Fedora 13 with a recent Fedora 14 kernel 
(2.6.35.9-64.fc14.i686.PAE)
and the system hung with messages like :

parent transid verify failed on 5937615339520 wanted 48547 found 48542

I've rebooted and and am attempting to recover with btrfsck from the 
btrfs-progs-unstable
git tree, but it is segfaulting after finding a superblock and listing out 3 of 
the
parent transid messages.  Anyone have any ideas?

I tried btrfsck /dev/sdb, btrfsck -s 1 /dev/sdb, and btrfsck -s 2 /dev/sdb with 
the
same result for each.  The btrfsck binary I compiled does work on a small 
(800MB) test
btrfs file system.  I suspect it may be due to the size of the filesystem I am 
trying
to repair.

Running btrfsck with gdb returns :
#0  find_first_block_group (root=0x8067178, path=0x80677f8, key=0xb24b) at 
extent-tree.c:3028
#1  0x08055603 in btrfs_read_block_groups (root=0x8067178) at extent-tree.c:3072
#2  0x08053009 in open_ctree_fd (fp=7, path=0xb63a /dev/sdb, 
sb_bytenr=value optimized out, writes=0) at disk-io.c:760
#3  0x080530e8 in open_ctree (filename=0xb63a /dev/sdb, sb_bytenr=0, 
writes=0) at disk-io.c:587
#4  0x0804d3fc in main (ac=value optimized out, av=Cannot access memory at 
address 0x4

In any event, recovering the data would be nice and any ideas to do so would be 
appreciated.

-- 

Andrew Schretter
Systems Programmer, Duke University
Dept. of Mathematics (919) 660-2866

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck segmentation fault

2011-01-07 Thread cwillu
On Fri, Jan 7, 2011 at 3:15 PM, Andrew Schretter schr...@math.duke.edu wrote:
 I have a 10TB btrfs filesystem over iSCSI that is currently unmountable.  I'm
 currently running Fedora 13 with a recent Fedora 14 kernel 
 (2.6.35.9-64.fc14.i686.PAE)
 and the system hung with messages like :

 parent transid verify failed on 5937615339520 wanted 48547 found 48542

 I've rebooted and and am attempting to recover with btrfsck from the 
 btrfs-progs-unstable
 git tree, but it is segfaulting after finding a superblock and listing out 3 
 of the
 parent transid messages.  Anyone have any ideas?

 I tried btrfsck /dev/sdb, btrfsck -s 1 /dev/sdb, and btrfsck -s 2 /dev/sdb 
 with the
 same result for each.  The btrfsck binary I compiled does work on a small 
 (800MB) test
 btrfs file system.  I suspect it may be due to the size of the filesystem I 
 am trying
 to repair.

Segfaulting is what the current btrfsck does when it finds a problem;
it doesn't try to fix anything yet.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-07 Thread Peter A
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote:
 What is the smallest granularity that the datadomain searches for in
 terms of dedup?
 
 Josef's current setup isn't restricted to a specific block size, but
 there is a min match of 4k.
I talked to a few people I know and didn't get a clear answer either... 
However, 512 bytes came up more than once. 

I'm not really worried about the size of region to be used, but about 
offsetting it... its so easy to create large tars, ... where the content is 
offset by a few bytes, mutliples of 512 and such.

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-07 Thread Phillip Susi

On 01/07/2011 09:58 AM, Chris Mason wrote:

Yes and no.  We have a best effort mechanism where we try to guess that
since you've done this truncate and the write that you want the writes
to show up quickly.  But its a guess.


It is a pretty good guess, and one that the NT kernel has been making 
for 15 years or so.  I've been following this issue for some time and I 
still don't understand why Ted is so hostile to this and can't make it 
work right on ext4.  When you get a rename() you just need to check if 
there are outstanding journal transactions and/or dirty cache pages, and 
hang the rename() transaction on the end of those.  That way if the 
system crashes after the new file has fully hit the disk, the old file 
is gone and you only have the new one, but if it crashes before, you 
still have the old one in place.


Both the writes and the rename can be delayed in the cache to an 
arbitrary point in the future; what matters is that their order is 
preserved.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html