Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-19 Thread Clint Byrum
Excerpts from Alexandre Lécuyer's message of 2017-06-19 11:36:15 +0200:
> Hello Clint,
> 
> Thanks for your feedback, replying in the email inline.
> 
> On 06/16/2017 10:54 PM, Clint Byrum wrote:
> > Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
> >> On 16 Jun 2017, at 10:51, Clint Byrum wrote:
> >>
> >>> This is great work.
> >>>
> >>> I'm sure you've already thought of this, but could you explain why
> >>> you've chosen not to put the small objects in the k/v store as part of
> >>> the value rather than in secondary large files?
> >> I don't want to co-opt an answer from Alex, but I do want to point to some 
> >> of the other background on this LOSF work.
> >>
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation
> >>
> > These are great. Thanks for sharing them, I understand a lot more now.
> >
> >> Look at the second link for some context to your answer, but the summary 
> >> is "that means writing a file system, and writing a file system is really 
> >> hard".
> >>
> > I'm not sure we were thinking the same thing.
> >
> > I was more asking, why not put the content of the object into the k/v
> > instead of the big_file_id:offset? My thinking was that for smaller
> > objects, you would just return the data immediately upon reading the k/v,
> > rather than then needing to go find the big file and read the offset.
> > However, I'm painfully aware that those directly involved with the problem
> > have likely thought of this. However, the experiments don't seem to show
> > that this was attempted. Perhaps I'm zooming too far out to see the real
> > problem space. You can all tell me to take my spray paint can and stop
> > staring at the bike shed if this is just too annoying. Seriously.
> >
> > Of course, one important thing is, what does one consider "small"? Seems
> > like there's a size where the memory footprint of storing it in the
> > k/v would be justifiable if reads just returned immediately from k/v
> > vs. needing to also go get data from a big file on disk. Perhaps that
> > size is too low to really matter. I was hoping that this had been
> > considered and there was documentation, but I don't really see it.
> Right, we had considered this when we started the project : storing 
> small objects directly in the KV. It would not be too diffcult to do, 
> but we see a few problems :
> 
> 1) consistency
> In the current design, we append data at the end of a "big file". When 
> the data upload is finished, swift writes the metadata and commits the 
> file. This triggers a fsync(). Only then do we return. We can rely on 
> the data being stable on disk, even if there is a power loss.  Because 
> we fallocate() space for the "big files" beforehand, we can also hope to 
> have mostly sequential disk IO.
> (Important as most swift clusters use SATA disks).
> 
> Once the object has been committed, we create an entry for it in the KV. 
> This is done asynchronously, because synchronous writes on the KV kills 
> performance. If we loose power, we loose the latest data. After the 
> server is rebooted, we have to scan the end of volumes to create missing 
> entries in the KV. (I will not discuss this in detail in this email to 
> keep this short, but we can discuss it in another thread, or I can post 
> some information on the wiki).
> 
> If we put small objects in the KV, we would need to do synchronous 
> writes to make sure we don't loose data.
> Also, currently we can completly reconstruct the KV from the "big 
> files". It would not be possible anymore.
> 
> 
> 2) performance
> On our clusters we see about 40% of physical disk IO being caused by 
> readdir().
> We want to serve directory listing requests from memory. So "small" 
> means "the KV can fit in the page cache".
> We estimate that we need the size per object to be below 50 bytes, which 
> doesn't leave much room for data.
> 
> LevelDB causes write amplification, as it will regularly copy data to 
> different files (levels) to keep keys compressed and in sorted order. If 
> we store object data within the KV, it will be copied around multiple 
> times as well.
> 
> 
> Finally it is also more simple to have only one path to handle. Beyond 
> these issues, it would not be difficult to store data in the KV. This is 
> something we can revisit after more test and maybe some production 
> experience.
> 

Really great explanation. Thanks for sharing. I hope we can all learn
from the thorough approach you've taken to this problem. Good luck!

> >
> > Also the "writing your own filesystem" option in experiments seemed
> > more like a thing to do if you left the k/v stores out entirely.
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-19 Thread Alexandre Lécuyer

Hello John,

Thanks for your comments! Replying inline

On 06/16/2017 07:06 PM, John Dickinson wrote:

Alex, this is fantastic work and great info. Thanks for sharing it.

Additional comments inline.

On 16 Jun 2017, at 6:54, Alexandre Lécuyer wrote:


Swift stores objects on a regular filesystem (XFS is recommended), one file per 
object. While it works fine for medium or big objects, when you have lots of 
small objects you can run into issues: because of the high count of inodes on 
the object servers, they can’t stay in cache, implying lot of memory usage and 
IO operations to fetch inodes from disk.

In the past few months, we’ve been working on implementing a new storage backend 
in Swift. It is highly inspired by haystack[1]. In a few words, objects are stored 
in big files, and a Key/Value store provides information to locate an object 
(object hash -> big_file_id:offset). As the mapping in the K/V consumes less 
memory than an inode, it is possible to keep all entries in memory, saving many IO 
to locate the object. It also allows some performance improvements by limiting the 
XFS meta updates (e.g.: almost no inode updates as we write objects by using 
fdatasync() instead of fsync())

One of the questions that was raised during discussions about this design is: 
do we want one K/V store per device, or one K/V store per Swift partition (= 
multiple K/V per device). The concern was about failure domain. If the only K/V 
gets corrupted, the whole device must be reconstructed. Memory usage is a major 
point in making a decision, so we did some benchmark.

The key-value store is implemented over LevelDB.
Given a single disk with 20 million files (could be either one object replica 
or one fragment, if using EC)

I have tested three cases :
   - single KV for the whole disk
   - one KV per partition, with 100 partitions per disk
   - one KV per partition, with 1000 partitions per disk

Single KV for the disk :
   - DB size: 750 MB
   - bytes per object: 38

One KV per partition :
Assuming :
   - 100 partitions on the disk (=> 100 KV)
   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
prefix)

   - 7916 KB per KV, total DB size: 773 MB
   - bytes per object: 41

One KV per partition :
Assuming :
   - 1000 partitions on the disk (=> 1000 KV)
   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
prefix)

   - 1388 KB per KV, total DB size: 1355 MB total
   - bytes per object: 71

A typical server we use for swift clusters has 36 drives, which gives us :
- Single KV : 26 GB
- Split KV, 100 partitions : 28 GB (+7%)
- Split KV, 1000 partitions : 48 GB (+85%)

So, splitting seems reasonable if you don't have too many partitions.

Same test, with 10 million files instead of 20

- Single KV : 13 GB
- Split KV, 100 partitions : 18 GB (+38%)
- Split KV, 1000 partitions : 24 GB (+85%)


Finally, if we run a full compaction on the DB after the test, you get the
same memory usage in all cases, about 32 bytes per object.

We have not made enough tests to know what would happen in production. LevelDB
does trigger compaction automatically on parts of the DB, but continuous change
means we probably would not reach the smallest possible size.

This is likely a very good assumption (that the KV will continuously change and 
never get to minimum size).

My initial instinct is to go for one KV per drive.

One per partition does sound nice, but it is more sensitive to proper cluster 
configuration and deployment. For example, if an operator were to deploy a 
relatively small cluster but have a part power that's too big for the capacity, 
the KV strategy would end up with many thousands of mostly-empty partitions 
(imagine a 5-node cluster, 60 drives with a part power of 18 -- you're looking 
at more than 13k parts per drive per storage policy). Going for one KV per 
whole drive means that poor ring settings won't impact this area of storage as 
much.
That is also what we think. We will do more testing to confirm that one 
K/V per disk is stable with many objects under load, and if it does not 
corrupt when power outages occur. (we will have to recover a little 
data, but not rebuild the whole K/V).







Beyond the size issue, there are other things to consider :
File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
open during operation.

Having one KV per partition also means you have to move entries between KVs 
when you change the part power. (if we want to support that)

Yes, let's support that (in general)! But doing on KV per drive means it 
already works for this LOSF work.


A compromise may be to split KVs on a small prefix of the object's hash, 
independent of swift's configuration.

This is an interesting idea to explore. It will allow for smaller individual KV 
stores without being as sensitive to the ring parameters.


As you can see we're still thinking about this. Any ideas are welcome !
We will keep you updated about more "real 

Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-19 Thread Alexandre Lécuyer

Hello Clint,

Thanks for your feedback, replying in the email inline.

On 06/16/2017 10:54 PM, Clint Byrum wrote:

Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:

On 16 Jun 2017, at 10:51, Clint Byrum wrote:


This is great work.

I'm sure you've already thought of this, but could you explain why
you've chosen not to put the small objects in the k/v store as part of
the value rather than in secondary large files?

I don't want to co-opt an answer from Alex, but I do want to point to some of 
the other background on this LOSF work.

https://wiki.openstack.org/wiki/Swift/ideas/small_files
https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation


These are great. Thanks for sharing them, I understand a lot more now.


Look at the second link for some context to your answer, but the summary is "that 
means writing a file system, and writing a file system is really hard".


I'm not sure we were thinking the same thing.

I was more asking, why not put the content of the object into the k/v
instead of the big_file_id:offset? My thinking was that for smaller
objects, you would just return the data immediately upon reading the k/v,
rather than then needing to go find the big file and read the offset.
However, I'm painfully aware that those directly involved with the problem
have likely thought of this. However, the experiments don't seem to show
that this was attempted. Perhaps I'm zooming too far out to see the real
problem space. You can all tell me to take my spray paint can and stop
staring at the bike shed if this is just too annoying. Seriously.

Of course, one important thing is, what does one consider "small"? Seems
like there's a size where the memory footprint of storing it in the
k/v would be justifiable if reads just returned immediately from k/v
vs. needing to also go get data from a big file on disk. Perhaps that
size is too low to really matter. I was hoping that this had been
considered and there was documentation, but I don't really see it.
Right, we had considered this when we started the project : storing 
small objects directly in the KV. It would not be too diffcult to do, 
but we see a few problems :


1) consistency
In the current design, we append data at the end of a "big file". When 
the data upload is finished, swift writes the metadata and commits the 
file. This triggers a fsync(). Only then do we return. We can rely on 
the data being stable on disk, even if there is a power loss.  Because 
we fallocate() space for the "big files" beforehand, we can also hope to 
have mostly sequential disk IO.

(Important as most swift clusters use SATA disks).

Once the object has been committed, we create an entry for it in the KV. 
This is done asynchronously, because synchronous writes on the KV kills 
performance. If we loose power, we loose the latest data. After the 
server is rebooted, we have to scan the end of volumes to create missing 
entries in the KV. (I will not discuss this in detail in this email to 
keep this short, but we can discuss it in another thread, or I can post 
some information on the wiki).


If we put small objects in the KV, we would need to do synchronous 
writes to make sure we don't loose data.
Also, currently we can completly reconstruct the KV from the "big 
files". It would not be possible anymore.



2) performance
On our clusters we see about 40% of physical disk IO being caused by 
readdir().
We want to serve directory listing requests from memory. So "small" 
means "the KV can fit in the page cache".
We estimate that we need the size per object to be below 50 bytes, which 
doesn't leave much room for data.


LevelDB causes write amplification, as it will regularly copy data to 
different files (levels) to keep keys compressed and in sorted order. If 
we store object data within the KV, it will be copied around multiple 
times as well.



Finally it is also more simple to have only one path to handle. Beyond 
these issues, it would not be difficult to store data in the KV. This is 
something we can revisit after more test and maybe some production 
experience.




Also the "writing your own filesystem" option in experiments seemed
more like a thing to do if you left the k/v stores out entirely.






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-16 Thread Clint Byrum
Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
> 
> On 16 Jun 2017, at 10:51, Clint Byrum wrote:
> 
> > This is great work.
> >
> > I'm sure you've already thought of this, but could you explain why
> > you've chosen not to put the small objects in the k/v store as part of
> > the value rather than in secondary large files?
> 
> I don't want to co-opt an answer from Alex, but I do want to point to some of 
> the other background on this LOSF work.
> 
> https://wiki.openstack.org/wiki/Swift/ideas/small_files
> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation
> 

These are great. Thanks for sharing them, I understand a lot more now.

> Look at the second link for some context to your answer, but the summary is 
> "that means writing a file system, and writing a file system is really hard".
> 

I'm not sure we were thinking the same thing.

I was more asking, why not put the content of the object into the k/v
instead of the big_file_id:offset? My thinking was that for smaller
objects, you would just return the data immediately upon reading the k/v,
rather than then needing to go find the big file and read the offset.
However, I'm painfully aware that those directly involved with the problem
have likely thought of this. However, the experiments don't seem to show
that this was attempted. Perhaps I'm zooming too far out to see the real
problem space. You can all tell me to take my spray paint can and stop
staring at the bike shed if this is just too annoying. Seriously.

Of course, one important thing is, what does one consider "small"? Seems
like there's a size where the memory footprint of storing it in the
k/v would be justifiable if reads just returned immediately from k/v
vs. needing to also go get data from a big file on disk. Perhaps that
size is too low to really matter. I was hoping that this had been
considered and there was documentation, but I don't really see it.

Also the "writing your own filesystem" option in experiments seemed
more like a thing to do if you left the k/v stores out entirely.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-16 Thread John Dickinson


On 16 Jun 2017, at 10:51, Clint Byrum wrote:

> This is great work.
>
> I'm sure you've already thought of this, but could you explain why
> you've chosen not to put the small objects in the k/v store as part of
> the value rather than in secondary large files?

I don't want to co-opt an answer from Alex, but I do want to point to some of 
the other background on this LOSF work.

https://wiki.openstack.org/wiki/Swift/ideas/small_files
https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation

Look at the second link for some context to your answer, but the summary is 
"that means writing a file system, and writing a file system is really hard".

--John



>
> Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200:
>> Swift stores objects on a regular filesystem (XFS is recommended), one file 
>> per object. While it works fine for medium or big objects, when you have 
>> lots of small objects you can run into issues: because of the high count of 
>> inodes on the object servers, they can’t stay in cache, implying lot of 
>> memory usage and IO operations to fetch inodes from disk.
>>
>> In the past few months, we’ve been working on implementing a new storage 
>> backend in Swift. It is highly inspired by haystack[1]. In a few words, 
>> objects are stored in big files, and a Key/Value store provides information 
>> to locate an object (object hash -> big_file_id:offset). As the mapping in 
>> the K/V consumes less memory than an inode, it is possible to keep all 
>> entries in memory, saving many IO to locate the object. It also allows some 
>> performance improvements by limiting the XFS meta updates (e.g.: almost no 
>> inode updates as we write objects by using fdatasync() instead of fsync())
>>
>> One of the questions that was raised during discussions about this design 
>> is: do we want one K/V store per device, or one K/V store per Swift 
>> partition (= multiple K/V per device). The concern was about failure domain. 
>> If the only K/V gets corrupted, the whole device must be reconstructed. 
>> Memory usage is a major point in making a decision, so we did some benchmark.
>>
>> The key-value store is implemented over LevelDB.
>> Given a single disk with 20 million files (could be either one object 
>> replica or one fragment, if using EC)
>>
>> I have tested three cases :
>>- single KV for the whole disk
>>- one KV per partition, with 100 partitions per disk
>>- one KV per partition, with 1000 partitions per disk
>>
>> Single KV for the disk :
>>- DB size: 750 MB
>>- bytes per object: 38
>>
>> One KV per partition :
>> Assuming :
>>- 100 partitions on the disk (=> 100 KV)
>>- 16 bits part power (=> all keys in a given KV will have the same 16 bit 
>> prefix)
>>
>>- 7916 KB per KV, total DB size: 773 MB
>>- bytes per object: 41
>>
>> One KV per partition :
>> Assuming :
>>- 1000 partitions on the disk (=> 1000 KV)
>>- 16 bits part power (=> all keys in a given KV will have the same 16 bit 
>> prefix)
>>
>>- 1388 KB per KV, total DB size: 1355 MB total
>>- bytes per object: 71
>>
>>
>> A typical server we use for swift clusters has 36 drives, which gives us :
>> - Single KV : 26 GB
>> - Split KV, 100 partitions : 28 GB (+7%)
>> - Split KV, 1000 partitions : 48 GB (+85%)
>>
>> So, splitting seems reasonable if you don't have too many partitions.
>>
>> Same test, with 10 million files instead of 20
>>
>> - Single KV : 13 GB
>> - Split KV, 100 partitions : 18 GB (+38%)
>> - Split KV, 1000 partitions : 24 GB (+85%)
>>
>>
>> Finally, if we run a full compaction on the DB after the test, you get the
>> same memory usage in all cases, about 32 bytes per object.
>>
>> We have not made enough tests to know what would happen in production. 
>> LevelDB
>> does trigger compaction automatically on parts of the DB, but continuous 
>> change
>> means we probably would not reach the smallest possible size.
>>
>>
>> Beyond the size issue, there are other things to consider :
>> File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
>> open during operation.
>>
>> Having one KV per partition also means you have to move entries between KVs 
>> when you change the part power. (if we want to support that)
>>
>> A compromise may be to split KVs on a small prefix of the object's hash, 
>> independent of swift's configuration.
>>
>> As you can see we're still thinking about this. Any ideas are welcome !
>> We will keep you updated about more "real world" testing. Among the tests we 
>> plan to check how resilient the DB is in case of a power loss.
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


signature.asc

Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-16 Thread Clint Byrum
This is great work.

I'm sure you've already thought of this, but could you explain why
you've chosen not to put the small objects in the k/v store as part of
the value rather than in secondary large files?

Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200:
> Swift stores objects on a regular filesystem (XFS is recommended), one file 
> per object. While it works fine for medium or big objects, when you have lots 
> of small objects you can run into issues: because of the high count of inodes 
> on the object servers, they can’t stay in cache, implying lot of memory usage 
> and IO operations to fetch inodes from disk.
> 
> In the past few months, we’ve been working on implementing a new storage 
> backend in Swift. It is highly inspired by haystack[1]. In a few words, 
> objects are stored in big files, and a Key/Value store provides information 
> to locate an object (object hash -> big_file_id:offset). As the mapping in 
> the K/V consumes less memory than an inode, it is possible to keep all 
> entries in memory, saving many IO to locate the object. It also allows some 
> performance improvements by limiting the XFS meta updates (e.g.: almost no 
> inode updates as we write objects by using fdatasync() instead of fsync())
> 
> One of the questions that was raised during discussions about this design is: 
> do we want one K/V store per device, or one K/V store per Swift partition (= 
> multiple K/V per device). The concern was about failure domain. If the only 
> K/V gets corrupted, the whole device must be reconstructed. Memory usage is a 
> major point in making a decision, so we did some benchmark.
> 
> The key-value store is implemented over LevelDB.
> Given a single disk with 20 million files (could be either one object replica 
> or one fragment, if using EC)
> 
> I have tested three cases :
>- single KV for the whole disk
>- one KV per partition, with 100 partitions per disk
>- one KV per partition, with 1000 partitions per disk
> 
> Single KV for the disk :
>- DB size: 750 MB
>- bytes per object: 38
> 
> One KV per partition :
> Assuming :
>- 100 partitions on the disk (=> 100 KV)
>- 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
> 
>- 7916 KB per KV, total DB size: 773 MB
>- bytes per object: 41
> 
> One KV per partition :
> Assuming :
>- 1000 partitions on the disk (=> 1000 KV)
>- 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
> 
>- 1388 KB per KV, total DB size: 1355 MB total
>- bytes per object: 71
>
> 
> A typical server we use for swift clusters has 36 drives, which gives us :
> - Single KV : 26 GB
> - Split KV, 100 partitions : 28 GB (+7%)
> - Split KV, 1000 partitions : 48 GB (+85%)
> 
> So, splitting seems reasonable if you don't have too many partitions.
> 
> Same test, with 10 million files instead of 20
> 
> - Single KV : 13 GB
> - Split KV, 100 partitions : 18 GB (+38%)
> - Split KV, 1000 partitions : 24 GB (+85%)
> 
> 
> Finally, if we run a full compaction on the DB after the test, you get the
> same memory usage in all cases, about 32 bytes per object.
> 
> We have not made enough tests to know what would happen in production. LevelDB
> does trigger compaction automatically on parts of the DB, but continuous 
> change
> means we probably would not reach the smallest possible size.
> 
> 
> Beyond the size issue, there are other things to consider :
> File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
> open during operation.
> 
> Having one KV per partition also means you have to move entries between KVs 
> when you change the part power. (if we want to support that)
> 
> A compromise may be to split KVs on a small prefix of the object's hash, 
> independent of swift's configuration.
> 
> As you can see we're still thinking about this. Any ideas are welcome !
> We will keep you updated about more "real world" testing. Among the tests we 
> plan to check how resilient the DB is in case of a power loss.
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] Optimizing storage for small objects in Swift

2017-06-16 Thread John Dickinson
Alex, this is fantastic work and great info. Thanks for sharing it.

Additional comments inline.

On 16 Jun 2017, at 6:54, Alexandre Lécuyer wrote:

> Swift stores objects on a regular filesystem (XFS is recommended), one file 
> per object. While it works fine for medium or big objects, when you have lots 
> of small objects you can run into issues: because of the high count of inodes 
> on the object servers, they can’t stay in cache, implying lot of memory usage 
> and IO operations to fetch inodes from disk.
>
> In the past few months, we’ve been working on implementing a new storage 
> backend in Swift. It is highly inspired by haystack[1]. In a few words, 
> objects are stored in big files, and a Key/Value store provides information 
> to locate an object (object hash -> big_file_id:offset). As the mapping in 
> the K/V consumes less memory than an inode, it is possible to keep all 
> entries in memory, saving many IO to locate the object. It also allows some 
> performance improvements by limiting the XFS meta updates (e.g.: almost no 
> inode updates as we write objects by using fdatasync() instead of fsync())
>
> One of the questions that was raised during discussions about this design is: 
> do we want one K/V store per device, or one K/V store per Swift partition (= 
> multiple K/V per device). The concern was about failure domain. If the only 
> K/V gets corrupted, the whole device must be reconstructed. Memory usage is a 
> major point in making a decision, so we did some benchmark.
>
> The key-value store is implemented over LevelDB.
> Given a single disk with 20 million files (could be either one object replica 
> or one fragment, if using EC)
>
> I have tested three cases :
>   - single KV for the whole disk
>   - one KV per partition, with 100 partitions per disk
>   - one KV per partition, with 1000 partitions per disk
>
> Single KV for the disk :
>   - DB size: 750 MB
>   - bytes per object: 38
>
> One KV per partition :
> Assuming :
>   - 100 partitions on the disk (=> 100 KV)
>   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
>
>   - 7916 KB per KV, total DB size: 773 MB
>   - bytes per object: 41
>
> One KV per partition :
> Assuming :
>   - 1000 partitions on the disk (=> 1000 KV)
>   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
>
>   - 1388 KB per KV, total DB size: 1355 MB total
>   - bytes per object: 71
>
> A typical server we use for swift clusters has 36 drives, which gives us :
> - Single KV : 26 GB
> - Split KV, 100 partitions : 28 GB (+7%)
> - Split KV, 1000 partitions : 48 GB (+85%)
>
> So, splitting seems reasonable if you don't have too many partitions.
>
> Same test, with 10 million files instead of 20
>
> - Single KV : 13 GB
> - Split KV, 100 partitions : 18 GB (+38%)
> - Split KV, 1000 partitions : 24 GB (+85%)
>
>
> Finally, if we run a full compaction on the DB after the test, you get the
> same memory usage in all cases, about 32 bytes per object.
>
> We have not made enough tests to know what would happen in production. LevelDB
> does trigger compaction automatically on parts of the DB, but continuous 
> change
> means we probably would not reach the smallest possible size.

This is likely a very good assumption (that the KV will continuously change and 
never get to minimum size).

My initial instinct is to go for one KV per drive.

One per partition does sound nice, but it is more sensitive to proper cluster 
configuration and deployment. For example, if an operator were to deploy a 
relatively small cluster but have a part power that's too big for the capacity, 
the KV strategy would end up with many thousands of mostly-empty partitions 
(imagine a 5-node cluster, 60 drives with a part power of 18 -- you're looking 
at more than 13k parts per drive per storage policy). Going for one KV per 
whole drive means that poor ring settings won't impact this area of storage as 
much.

>
>
> Beyond the size issue, there are other things to consider :
> File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
> open during operation.
>
> Having one KV per partition also means you have to move entries between KVs 
> when you change the part power. (if we want to support that)

Yes, let's support that (in general)! But doing on KV per drive means it 
already works for this LOSF work.

>
> A compromise may be to split KVs on a small prefix of the object's hash, 
> independent of swift's configuration.

This is an interesting idea to explore. It will allow for smaller individual KV 
stores without being as sensitive to the ring parameters.

>
> As you can see we're still thinking about this. Any ideas are welcome !
> We will keep you updated about more "real world" testing. Among the tests we 
> plan to check how resilient the DB is in case of a power loss.

I'd also be very interested in other tests around concurrent access to the KV 
store. If we've only got one per whole