[patch] btrfs: use add_to_page_cache_lru, use __page_cache_alloc

2010-03-17 Thread Nick Piggin
btrfs: use add_to_page_cache_lru, use __page_cache_alloc

Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.

add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.

Signed-off-by: Nick Piggin npig...@suse.de:
---
 fs/btrfs/compression.c |   20 ++--
 fs/btrfs/extent_io.c   |   22 +-
 2 files changed, 7 insertions(+), 35 deletions(-)

Index: linux-2.6/fs/btrfs/compression.c
===
--- linux-2.6.orig/fs/btrfs/compression.c
+++ linux-2.6/fs/btrfs/compression.c
@@ -31,7 +31,6 @@
 #include linux/swap.h
 #include linux/writeback.h
 #include linux/bit_spinlock.h
-#include linux/pagevec.h
 #include compat.h
 #include ctree.h
 #include disk-io.h
@@ -445,7 +444,6 @@ static noinline int add_ra_bio_pages(str
unsigned long nr_pages = 0;
struct extent_map *em;
struct address_space *mapping = inode-i_mapping;
-   struct pagevec pvec;
struct extent_map_tree *em_tree;
struct extent_io_tree *tree;
u64 end;
@@ -461,7 +459,6 @@ static noinline int add_ra_bio_pages(str
 
end_index = (i_size_read(inode) - 1)  PAGE_CACHE_SHIFT;
 
-   pagevec_init(pvec, 0);
while (last_offset  compressed_end) {
page_index = last_offset  PAGE_CACHE_SHIFT;
 
@@ -478,26 +475,15 @@ static noinline int add_ra_bio_pages(str
goto next;
}
 
-   page = alloc_page(mapping_gfp_mask(mapping)  ~__GFP_FS);
+   page = __page_cache_alloc(mapping_gfp_mask(mapping)  ~ 
__GFP_FS);
if (!page)
break;
 
-   page-index = page_index;
-   /*
-* what we want to do here is call add_to_page_cache_lru,
-* but that isn't exported, so we reproduce it here
-*/
-   if (add_to_page_cache(page, mapping,
- page-index, GFP_NOFS)) {
+   if (add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
page_cache_release(page);
goto next;
}
 
-   /* open coding of lru_cache_add, also not exported */
-   page_cache_get(page);
-   if (!pagevec_add(pvec, page))
-   __pagevec_lru_add_file(pvec);
-
end = last_offset + PAGE_CACHE_SIZE - 1;
/*
 * at this point, we have a locked page in the page cache
@@ -551,8 +537,6 @@ static noinline int add_ra_bio_pages(str
 next:
last_offset += PAGE_CACHE_SIZE;
}
-   if (pagevec_count(pvec))
-   __pagevec_lru_add_file(pvec);
return 0;
 }
 
Index: linux-2.6/fs/btrfs/extent_io.c
===
--- linux-2.6.orig/fs/btrfs/extent_io.c
+++ linux-2.6/fs/btrfs/extent_io.c
@@ -2663,33 +2663,21 @@ int extent_readpages(struct extent_io_tr
 {
struct bio *bio = NULL;
unsigned page_idx;
-   struct pagevec pvec;
unsigned long bio_flags = 0;
 
-   pagevec_init(pvec, 0);
for (page_idx = 0; page_idx  nr_pages; page_idx++) {
struct page *page = list_entry(pages-prev, struct page, lru);
 
prefetchw(page-flags);
list_del(page-lru);
-   /*
-* what we want to do here is call add_to_page_cache_lru,
-* but that isn't exported, so we reproduce it here
-*/
-   if (!add_to_page_cache(page, mapping,
+   if (add_to_page_cache_lru(page, mapping,
page-index, GFP_KERNEL)) {
-
-   /* open coding of lru_cache_add, also not exported */
-   page_cache_get(page);
-   if (!pagevec_add(pvec, page))
-   __pagevec_lru_add_file(pvec);
-   __extent_read_full_page(tree, page, get_extent,
-   bio, 0, bio_flags);
+   page_cache_release(page);
+   continue;
}
-   page_cache_release(page);
+   __extent_read_full_page(tree, page, get_extent,
+   bio, 0, bio_flags);
}
-   if (pagevec_count(pvec))
-   __pagevec_lru_add_file(pvec);
BUG_ON(!list_empty(pages));
if (bio)
submit_one_bio(READ, bio, 0, bio_flags);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to 

Re: Content based storage

2010-03-17 Thread David Brown

On 16/03/2010 23:45, Fabio wrote:

Some years ago I was searching for that kind of functionality and found
an experimental ext3 patch to allow the so-called COW-links:
http://lwn.net/Articles/76616/



I'd read about the COW patches for ext3 before.  While there is 
certainly some similarity here, there are a fair number of differences. 
 One is that those patches were aimed only at copying - there was no 
way to merge files later.  Another is that it was (as far as I can see) 
just an experimental hack to try out the concept.  Since it didn't take 
off, I think it is worth learning from, but not building on.



There was a discussion later on LWN http://lwn.net/Articles/77972/
an approach like COW-links would break POSIX standards.



I think a lot of the problems here were concerning inode numbers.  As 
far as I understand it, when you made an ext3-cow copy, the copy and the 
original had different inode numbers.  That meant the userspace programs 
saw them as different files, and you could have different owners, 
attributes, etc., while keeping the data linked.  But that broke a 
common optimisation when doing large diff's - thus some people wanted to 
have the same inode for each file and that /definitely/ broke posix.


With btrfs, the file copies would each have their own inode - it would, 
I think, be posix compliant as it is transparent to user programs.  The 
diff optimisation discussed in the articles you sited would not work - 
but if btrfs becomes the standard Linux file system, then user 
applications like diff can be extended with btrfs-specific optimisations 
if necessary.



I am not very technical and don't know if it's feasible in btrfs.


Nor am I very knowledgeable in this area (most of my programming is on 
8-bit processors), but I believe btrfs is already designed to support 
larger checksums (32-bit CRCs are not enough to say that data is 
identical), and the cp --reflink shows how the underlying link is made.



I think most likely you'll have to run an userspace tool to find and
merge identical files based on checksums (which already sounds good to me).


This sounds right to me.  In fact, it would be possible to do today, 
entirely from within user space - but files would need to be compared 
long-hand before merging.  With larger checksums, the userspace daemon 
would be much more efficient.



The only thing we can ask the developers at the moment is if something
like that would be possible without changes to the on-disk format.



I guess that's partly why I made these posts!



PS. Another great scenario is shared hosting web/file servers: ten of
thousand website with mostly the same tiny PHP Joomla files.
If you can get the benefits of: compression + content based/cowlinks +
FS Cache... That would really make Btrfs FLY on Hard Disk and make SSD
devices possible for storage (because of the space efficiency).



That's a good point.

People often think that hard disk space is cheap these days - but being 
space efficient means you can use an SSD instead of a hard disk.  And 
for on-disk backups, it means you can use a small number of disks even 
though the users think I've got a huge hard disk, I can make lots of 
copies of these files !


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread David Brown

On 17/03/2010 01:45, Hubert Kario wrote:

On Tuesday 16 March 2010 10:21:43 David Brown wrote:

Hi,

I was wondering if there has been any thought or progress in
content-based storage for btrfs beyond the suggestion in the Project
ideas wiki page?

The basic idea, as I understand it, is that a longer data extent
checksum is used (long enough to make collisions unrealistic), and merge
data extents with the same checksums.  The result is that cp foo bar
will have pretty much the same effect as cp --reflink foo bar - the
two copies will share COW data extents - as long as they remain the
same, they will share the disk space.  But you can still access each
file independently, unlike with a traditional hard link.

I can see at least three cases where this could be a big win - I'm sure
there are more.

Developers often have multiple copies of source code trees as branches,
snapshots, etc.  For larger projects (I have multiple buildroot trees
for one project) this can take a lot of space.  Content-based storage
would give the space efficiency of hard links with the independence of
straight copies.  Using cp --reflink would help for the initial
snapshot or branch, of course, but it could not help after the copy.

On servers using lightweight virtual servers such as OpenVZ, you have
multiple root file systems each with their own copy of /usr, etc.
With OpenVZ, all the virtual roots are part of the host's file system
(i.e., not hidden within virtual disks), so content-based storage could
merge these, making them very much more efficient.  Because each of
these virtual roots can be updated independently, it is not possible to
use cp --reflink to keep them merged.

For backup systems, you will often have multiple copies of the same
files.  A common scheme is to use rsync and cp -al to make hard-linked
(and therefore space-efficient) snapshots of the trees.  But sometimes
these things get out of synchronisation - perhaps your remote rsync dies
halfway, and you end up with multiple independent copies of the same
files.  Content-based storage can then re-merge these files.


I would imagine that content-based storage will sometimes be a
performance win, sometimes a loss.  It would be a win when merging
results in better use of the file system cache - OpenVZ virtual serving
would be an example where you would be using multiple copies of the same
file at the same time.  For other uses, such as backups, there would be
no performance gain since you seldom (hopefully!) read the backup files.
   But in that situation, speed is not a major issue.


mvh.,

David


 From what I could read, content based storage is supposed to be in-line
deduplication, there are already plans to do (probably) a userland daemon
traversing the FS and merging indentical extents -- giving you post-process
deduplication.

For a rather heavy used host (such as a VM host) you'd probably want to use
post-process dedup -- as the daemon can be easly stopped or be given lower
priority. In line dedup is quite CPU intensive.

In line dedup is very nice for backup though -- you don't need the temporary
storage before the (mostly unchanged) data is deduplicated.


I think post-process deduplication is the way to go here, using a 
userspace daemon.  It's the most flexible solution.  As you say, inline 
dedup could be nice in some cases, such as for backups, since the cpu 
time cost is not an issue there.  However, in a typical backup 
situation, the new files are often written fairly slowly (for remote 
backups).  Even for local backups, there is generally not that much 
/new/ data, since you normally use some sort of incremental backup 
scheme (such as rsync, combined with cp -al or cp --reflink).  Thus it 
should be fine to copy over the data, then de-dup it later or in the 
background.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread Heinz-Josef Claes
Hi,

just want to add one correction to your thoughts:

Storage is not cheap if you think about enterprise storage on a SAN, 
replicated to another data centre. Using dedup on the storage boxes leads to 
performance issues and other problems - only NetApp is offering this at the 
moment and it's not heavily used (because of the issues).

So I think it would be a big advantage for professional use to have dedup 
build into the filesystem - processors are faster and faster today and not the 
cost drivers any more. I do not think it's a problem to spend on core of a 2 
socket box with 12 cores for this purpose.
Storage is cost intensive:
- SAN boxes are expensive
- RAID5 in two locations is expensive
- FC lines between locations is expensive (depeding very much on where you 
are).

Naturally, you would not use this feature for all kind of use cases (eg. 
heavily used database), but I think there is enough need.

my 2 cents,
Heinz-Josef Claes

On Wednesday 17 March 2010 09:27:15 you wrote:
 On 17/03/2010 01:45, Hubert Kario wrote:
  On Tuesday 16 March 2010 10:21:43 David Brown wrote:
  Hi,
  
  I was wondering if there has been any thought or progress in
  content-based storage for btrfs beyond the suggestion in the Project
  ideas wiki page?
  
  The basic idea, as I understand it, is that a longer data extent
  checksum is used (long enough to make collisions unrealistic), and merge
  data extents with the same checksums.  The result is that cp foo bar
  will have pretty much the same effect as cp --reflink foo bar - the
  two copies will share COW data extents - as long as they remain the
  same, they will share the disk space.  But you can still access each
  file independently, unlike with a traditional hard link.
  
  I can see at least three cases where this could be a big win - I'm sure
  there are more.
  
  Developers often have multiple copies of source code trees as branches,
  snapshots, etc.  For larger projects (I have multiple buildroot trees
  for one project) this can take a lot of space.  Content-based storage
  would give the space efficiency of hard links with the independence of
  straight copies.  Using cp --reflink would help for the initial
  snapshot or branch, of course, but it could not help after the copy.
  
  On servers using lightweight virtual servers such as OpenVZ, you have
  multiple root file systems each with their own copy of /usr, etc.
  With OpenVZ, all the virtual roots are part of the host's file system
  (i.e., not hidden within virtual disks), so content-based storage could
  merge these, making them very much more efficient.  Because each of
  these virtual roots can be updated independently, it is not possible to
  use cp --reflink to keep them merged.
  
  For backup systems, you will often have multiple copies of the same
  files.  A common scheme is to use rsync and cp -al to make hard-linked
  (and therefore space-efficient) snapshots of the trees.  But sometimes
  these things get out of synchronisation - perhaps your remote rsync dies
  halfway, and you end up with multiple independent copies of the same
  files.  Content-based storage can then re-merge these files.
  
  
  I would imagine that content-based storage will sometimes be a
  performance win, sometimes a loss.  It would be a win when merging
  results in better use of the file system cache - OpenVZ virtual serving
  would be an example where you would be using multiple copies of the same
  file at the same time.  For other uses, such as backups, there would be
  no performance gain since you seldom (hopefully!) read the backup files.
  
 But in that situation, speed is not a major issue.
  
  mvh.,
  
  David
  
   From what I could read, content based storage is supposed to be in-line
  
  deduplication, there are already plans to do (probably) a userland daemon
  traversing the FS and merging indentical extents -- giving you
  post-process deduplication.
  
  For a rather heavy used host (such as a VM host) you'd probably want to
  use post-process dedup -- as the daemon can be easly stopped or be given
  lower priority. In line dedup is quite CPU intensive.
  
  In line dedup is very nice for backup though -- you don't need the
  temporary storage before the (mostly unchanged) data is deduplicated.
 
 I think post-process deduplication is the way to go here, using a
 userspace daemon.  It's the most flexible solution.  As you say, inline
 dedup could be nice in some cases, such as for backups, since the cpu
 time cost is not an issue there.  However, in a typical backup
 situation, the new files are often written fairly slowly (for remote
 backups).  Even for local backups, there is generally not that much
 /new/ data, since you normally use some sort of incremental backup
 scheme (such as rsync, combined with cp -al or cp --reflink).  Thus it
 should be fine to copy over the data, then de-dup it later or in the
 background.
 
 --
 To unsubscribe from this list: send the line unsubscribe 

[no subject]

2010-03-17 Thread dm

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] btrfs: use add_to_page_cache_lru, use __page_cache_alloc

2010-03-17 Thread Nick Piggin
On Wed, Mar 17, 2010 at 05:20:53PM +1100, Nick Piggin wrote:
 btrfs: use add_to_page_cache_lru, use __page_cache_alloc
 
 Pagecache pages should be allocated with __page_cache_alloc, so they
 obey pagecache memory policies.
 
 add_to_page_cache_lru is exported, so it should be used. Benefits over
 using a private pagevec: neater code, 128 bytes fewer stack used, percpu
 lru ordering is preserved, and finally don't need to flush pagevec
 before returning so batching may be shared with other LRU insertions.
 
 Signed-off-by: Nick Piggin npig...@suse.de:

Missed a rediff.
---
 fs/btrfs/compression.c |   20 ++--
 fs/btrfs/extent_io.c   |   22 +-
 2 files changed, 7 insertions(+), 35 deletions(-)

Index: linux-2.6/fs/btrfs/compression.c
===
--- linux-2.6.orig/fs/btrfs/compression.c
+++ linux-2.6/fs/btrfs/compression.c
@@ -31,7 +31,6 @@
 #include linux/swap.h
 #include linux/writeback.h
 #include linux/bit_spinlock.h
-#include linux/pagevec.h
 #include compat.h
 #include ctree.h
 #include disk-io.h
@@ -445,7 +444,6 @@ static noinline int add_ra_bio_pages(str
unsigned long nr_pages = 0;
struct extent_map *em;
struct address_space *mapping = inode-i_mapping;
-   struct pagevec pvec;
struct extent_map_tree *em_tree;
struct extent_io_tree *tree;
u64 end;
@@ -461,7 +459,6 @@ static noinline int add_ra_bio_pages(str
 
end_index = (i_size_read(inode) - 1)  PAGE_CACHE_SHIFT;
 
-   pagevec_init(pvec, 0);
while (last_offset  compressed_end) {
page_index = last_offset  PAGE_CACHE_SHIFT;
 
@@ -478,26 +475,17 @@ static noinline int add_ra_bio_pages(str
goto next;
}
 
-   page = alloc_page(mapping_gfp_mask(mapping)  ~__GFP_FS);
+   page = __page_cache_alloc(mapping_gfp_mask(mapping) 
+   ~__GFP_FS);
if (!page)
break;
 
-   page-index = page_index;
-   /*
-* what we want to do here is call add_to_page_cache_lru,
-* but that isn't exported, so we reproduce it here
-*/
-   if (add_to_page_cache(page, mapping,
- page-index, GFP_NOFS)) {
+   if (add_to_page_cache_lru(page, mapping, page_index,
+   GFP_NOFS)) {
page_cache_release(page);
goto next;
}
 
-   /* open coding of lru_cache_add, also not exported */
-   page_cache_get(page);
-   if (!pagevec_add(pvec, page))
-   __pagevec_lru_add_file(pvec);
-
end = last_offset + PAGE_CACHE_SIZE - 1;
/*
 * at this point, we have a locked page in the page cache
@@ -551,8 +539,6 @@ static noinline int add_ra_bio_pages(str
 next:
last_offset += PAGE_CACHE_SIZE;
}
-   if (pagevec_count(pvec))
-   __pagevec_lru_add_file(pvec);
return 0;
 }
 
Index: linux-2.6/fs/btrfs/extent_io.c
===
--- linux-2.6.orig/fs/btrfs/extent_io.c
+++ linux-2.6/fs/btrfs/extent_io.c
@@ -2663,33 +2663,21 @@ int extent_readpages(struct extent_io_tr
 {
struct bio *bio = NULL;
unsigned page_idx;
-   struct pagevec pvec;
unsigned long bio_flags = 0;
 
-   pagevec_init(pvec, 0);
for (page_idx = 0; page_idx  nr_pages; page_idx++) {
struct page *page = list_entry(pages-prev, struct page, lru);
 
prefetchw(page-flags);
list_del(page-lru);
-   /*
-* what we want to do here is call add_to_page_cache_lru,
-* but that isn't exported, so we reproduce it here
-*/
-   if (!add_to_page_cache(page, mapping,
+   if (add_to_page_cache_lru(page, mapping,
page-index, GFP_KERNEL)) {
-
-   /* open coding of lru_cache_add, also not exported */
-   page_cache_get(page);
-   if (!pagevec_add(pvec, page))
-   __pagevec_lru_add_file(pvec);
-   __extent_read_full_page(tree, page, get_extent,
-   bio, 0, bio_flags);
+   page_cache_release(page);
+   continue;
}
-   page_cache_release(page);
+   __extent_read_full_page(tree, page, get_extent,
+   bio, 0, bio_flags);
}
-   if (pagevec_count(pvec))
-   __pagevec_lru_add_file(pvec);

Re: Content based storage

2010-03-17 Thread Hubert Kario
On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
 Hi,
 
 just want to add one correction to your thoughts:
 
 Storage is not cheap if you think about enterprise storage on a SAN,
 replicated to another data centre. Using dedup on the storage boxes leads
  to performance issues and other problems - only NetApp is offering this at
  the moment and it's not heavily used (because of the issues).

there are at least two other suppliers with inline dedup products and there is 
OSS solution: lessfs

 So I think it would be a big advantage for professional use to have dedup
 build into the filesystem - processors are faster and faster today and not
  the cost drivers any more. I do not think it's a problem to spend on
  core of a 2 socket box with 12 cores for this purpose.
 Storage is cost intensive:
 - SAN boxes are expensive
 - RAID5 in two locations is expensive
 - FC lines between locations is expensive (depeding very much on where you
 are).

In-line dedup is expensive in two ways: first you have to cache the data going 
to disk and generate checksum for it, then you have to look if such block is 
already stored -- if the database doesn't fit into RAM (for a VM host it's more 
than likely) it requires at least few disk seeks, if not a few dozen for 
really big databases. Then you should read the block/extent back and compare 
them bit for bit. And only then write the data to the disk. That reduces your 
IOPS by at least an order of maginitude, if not more. 

For post-process dedup you can go as fast as your HDDs will allow you. And 
then, when your machine is mostly idle you can go and churn through the data.

IMHO in-line dedup is a good thing only as storage for backups -- when you 
have high probability that the stored data is duplicated (and with a 1:10 
dedup ratio you have 90% probability, it is).

So the CPU cost is only one factor. HDDs are a major bottleneck too.

All things considered, it would be best to have both post-process and in-line 
data deduplication, but I think, that in-line dedup will see much less use.

 
 Naturally, you would not use this feature for all kind of use cases (eg.
 heavily used database), but I think there is enough need.
 
 my 2 cents,
 Heinz-Josef Claes
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread Leszek Ciesielski
On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario h...@qbs.com.pl wrote:
 On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
 Hi,

 just want to add one correction to your thoughts:

 Storage is not cheap if you think about enterprise storage on a SAN,
 replicated to another data centre. Using dedup on the storage boxes leads
  to performance issues and other problems - only NetApp is offering this at
  the moment and it's not heavily used (because of the issues).

 there are at least two other suppliers with inline dedup products and there is
 OSS solution: lessfs

 So I think it would be a big advantage for professional use to have dedup
 build into the filesystem - processors are faster and faster today and not
  the cost drivers any more. I do not think it's a problem to spend on
  core of a 2 socket box with 12 cores for this purpose.
 Storage is cost intensive:
 - SAN boxes are expensive
 - RAID5 in two locations is expensive
 - FC lines between locations is expensive (depeding very much on where you
 are).

 In-line dedup is expensive in two ways: first you have to cache the data going
 to disk and generate checksum for it, then you have to look if such block is
 already stored -- if the database doesn't fit into RAM (for a VM host it's 
 more
 than likely) it requires at least few disk seeks, if not a few dozen for
 really big databases. Then you should read the block/extent back and compare
 them bit for bit. And only then write the data to the disk. That reduces your
 IOPS by at least an order of maginitude, if not more.

Sun decided that with SHA256 (which ZFS uses for normal checksumming)
collisions are unlikely enough to skip the read/compare step:
http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, of
course, with btrfs-used CRC32, but a switch to a stronger hash would
be recommended to reduce collisions anyway. And yes, for the truly
paranoid, a forced verification (after the hashes match) is always an
option.


 For post-process dedup you can go as fast as your HDDs will allow you. And
 then, when your machine is mostly idle you can go and churn through the data.

 IMHO in-line dedup is a good thing only as storage for backups -- when you
 have high probability that the stored data is duplicated (and with a 1:10
 dedup ratio you have 90% probability, it is).

 So the CPU cost is only one factor. HDDs are a major bottleneck too.

 All things considered, it would be best to have both post-process and in-line
 data deduplication, but I think, that in-line dedup will see much less use.


 Naturally, you would not use this feature for all kind of use cases (eg.
 heavily used database), but I think there is enough need.

 my 2 cents,
 Heinz-Josef Claes
 --
 Hubert Kario
 QBS - Quality Business Software
 02-656 Warszawa, ul. Ksawerów 30/85
 tel. +48 (22) 646-61-51, 646-74-24
 www.qbs.com.pl

 System Zarządzania Jakością
 zgodny z normą ISO 9001:2000
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


extent map merge bad block_len

2010-03-17 Thread jim owens
Chris,

Something that probably should be fixed is how
merging extent maps with block_len == -1 produces
illegal lengths, as in 8191.

I saw it with holes in directIO and it is not the
cause of my current problems so I'll hope someone
else decides to fix.

jim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread Hubert Kario
On Wednesday 17 March 2010 16:33:41 Leszek Ciesielski wrote:
 On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario h...@qbs.com.pl wrote:
  On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
  Hi,
 
  just want to add one correction to your thoughts:
 
  Storage is not cheap if you think about enterprise storage on a SAN,
  replicated to another data centre. Using dedup on the storage boxes
  leads to performance issues and other problems - only NetApp is offering
  this at the moment and it's not heavily used (because of the issues).
 
  there are at least two other suppliers with inline dedup products and
  there is OSS solution: lessfs
 
  So I think it would be a big advantage for professional use to have
  dedup build into the filesystem - processors are faster and faster today
  and not the cost drivers any more. I do not think it's a problem to
  spend on core of a 2 socket box with 12 cores for this purpose.
  Storage is cost intensive:
  - SAN boxes are expensive
  - RAID5 in two locations is expensive
  - FC lines between locations is expensive (depeding very much on where
  you are).
 
  In-line dedup is expensive in two ways: first you have to cache the data
  going to disk and generate checksum for it, then you have to look if such
  block is already stored -- if the database doesn't fit into RAM (for a VM
  host it's more than likely) it requires at least few disk seeks, if not a
  few dozen for really big databases. Then you should read the block/extent
  back and compare them bit for bit. And only then write the data to the
  disk. That reduces your IOPS by at least an order of maginitude, if not
  more.
 
 Sun decided that with SHA256 (which ZFS uses for normal checksumming)
 collisions are unlikely enough to skip the read/compare step:
 http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, of
 course, with btrfs-used CRC32, but a switch to a stronger hash would
 be recommended to reduce collisions anyway. And yes, for the truly
 paranoid, a forced verification (after the hashes match) is always an
 option.
 

If the server contains financial data I'd prefer the impossible not 
unlikely.

Read further, Sun did provide a way to enable the compare step by using 
verify instead of on:
zfs set dedup=verify pool

And, yes, I know that the probability of hardware malfunction is vastly higher 
than the probability of collision (that's why I wrote should, next time I'll 
write it as SHOULD as per RFC2119 ;), but, as the history showed, all hash 
algorithms are broken, the question is only when, if the FS does verify the 
data, then the attacker can't use the collisions to get data it souldn't have 
access to.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs: why default 4M readahead size?

2010-03-17 Thread Shaohua Li
Btrfs uses below equation to calculate ra_pages:
fs_info-bdi.ra_pages = max(fs_info-bdi.ra_pages,
4 * 1024 * 1024 / PAGE_CACHE_SIZE);
is the max() a typo of min()? This makes the readahead size is 4M by default,
which is too big.
I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
access 12 files for each disk. The fio test is mmap sequential read. I measure
the performance with different readahead size:
ra size io throughput
4M  268288 k/s
2M  367616 k/s
1M  431104 k/s
512K474112 k/s
256K512000 k/s
128K538624 k/s
The 4M default readahead size has poor performance.
I also does sync sequential read test, the test difference in't that big. But
the 4M case still has about 10% drop compared to the 512k case.

One might argue how about the case memory isn't tight. I tried only run a
one-disk setup with only one task. The 4M ra almost has no difference with the
128K ra. I guess the 128k default ra size for backing dev is carefuly choosed
to work with popular disks.
So my question is why we have a default 4M readahead size even with noraid case?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I have a list of 47,000 pharmaceutical companies in the US

2010-03-17 Thread slack Madrid


Email me at this address for a catalog of all our US lists: 
evangelina.bar...@lowestpricelists.co.cc

Also, ask about our sale pricing for more than one list.  
  




Send us an email to rem...@lowestpricelists.co.cc we will discontinue from the 
list
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html