Re: Fractal Tree Indexing over B-Trees?

2013-07-04 Thread Kẏra
Niels de Carpentier niels at decarpentier.com writes:
  I'd like to see how they do that.  The fact is you are still going to get
  random
  seeks since you have to binary search the blocks in an entire row since
  there is
  no way you can read a several thousand block row into memory to search it,
  so
  once your rows get pretty big you are doing just as much if not more
  random io
  as a btree.
 
 Why? The rows are sequential on disk, so you're never doing more random
 seeks than the number of rows as long as you can search faster than the
 disk can provide data. If they indeed can limit the number of searches per
 row to a constant, it shouldn't be an issue at all.
 
  I don't buy that its better in the insert case either for similar reasons,
  you
  are talking about having to rewrite entire rows to maintain the sequential
  nature of everything, so if you end up adding a giant row you have to go
  and
  write the whole thing out again, so you are talking about a lot more IO
  than
  with b-trees, albeit sequential.  So maybe it's a win since it's
  sequential but
  I wonder at what tree size that benefit no longer exists.
 
 The worst case might be bad, but on average it should be good. I don't
 doubt it is always better on rotating disks, probably not on ssd's.
 
  Of course all this analysis is based on a power point presentation and
  there may
  be some detail I'm missing, but that's mostly my point, claiming that
  b-trees
  are now obsolete because we can maybe do inserts faster is just idiotic.
 
 It's a presentation from a company, so lots of marketing. Of course it
 doesn't make btrees obsolete, it's optimized for specific cases. Like they
 say they reduce random seeks at the cost of disk bandwidth and cpu usage.
 It depends on the use case if that is useful or not. Probably not for
 btrfs, since the future will be ssd's, and meta data will generally be
 cached.
 
 Niels
 

I'm reviving a very old thread from here:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/16484

I found a bunch of more recent info from the company doing most of the work
behind fractal tree FS that I'd love to hear your thoughts on.

https://www.youtube.com/watch?v=c-n2LGPpQEw
http://www.tokutek.com/2013/02/mongodb-fractal-tree-indexes-high-compression
https://www.usenix.org/conference/hotstorage12/tokufs-streaming-file-system

Does this address earlier concerns at all? 

I just hope their work will be
released as unpatented GPL-compatible free software. 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread C Anthony Risinger
On Wed, Mar 28, 2012 at 9:25 AM, Danny Piccirillo
danny.picciri...@member.fsf.org wrote:
 The case has been made on Phoronix for F-Trees: They makes use hard
 drive speeds, not (relatively slow) access times; beat SSD's; and scale
 perfectly across multiple cores with hundreds of millions of entries.

 http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes

 How TokuDB Fractal Tree Databases Work

 Via: http://www.phoronix.com/scan.php?page=news_itempx=MTA3NjM

 Time for someone to get started on ftrfs? Or can it be implemented
 in Btrfs?
 https://bugzilla.kernel.org/show_bug.cgi?id=43004

whoa, very cool stuff.  fractals are awesome, cool to see them in use.

... 2010/11, surprised i never heard of it before now.  thanks for the
reference/links at the very least!

aside: i once described fractals to my grandmother (100% devout
catholic) as related to my own understanding of the universe --
specifically, i pointed out how their often simple mathematical
identity fragments into an infinitely self-similar pattern of
seemingly unbounded complexity -- she told me i was describing god ...
and, well, we agreed :-)

-- 

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
 The case has been made on Phoronix for F-Trees: They makes use hard
 drive speeds, not (relatively slow) access times; beat SSD's; and scale 
 perfectly across multiple cores with hundreds of millions of entries.
 
 http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes
 
 How TokuDB Fractal Tree Databases Work
 
 Via: http://www.phoronix.com/scan.php?page=news_itempx=MTA3NjM
 
 Time for someone to get started on ftrfs? Or can it be implemented 
 in Btrfs? 
 https://bugzilla.kernel.org/show_bug.cgi?id=43004
 

This is dumber shit than usual out of phoronix.  I did some just basic
calculations for worst case scenarios, let's say we max out btrfs's 8 level
limit, so we have 7 levels of nodes and then our leaves.  Lets just for
simplicity sake say we can fit 1 billion objects into this kind of tree, and for
each node/leaf we can only hold 10 objects.  (Keep in mind these are just random
numbers I'm pulling out of my ass and may not work out right with such a large
tree, just work with me here).

So worst case scenario doing a binary search for an object with 10 objects is 4
comparisons, so with a full 8 level btree we're doing 4 comparisons per level,
so 32 comparisons worst possible case to find our object.

Now let's consider a fractal tree with 1 billion objects.  So that would have a
29 row fractal tree (if I did my math right).  Now I have to make some
assumptions about how you would implement a fractal tree here, but let's assume
that every level tells you it's min and max value to make it easier to tell
which level you need to search in.  So worst case you're object is in the 29th
level, so you have to read 29 blocks in and check the min/max levels to figure
out which row to search.  Let's be fair and say these are in memory, we're still
doing 29 comparisons right out of the gate to find the right row.  Now we have
the right row, but this is a file system and the rows are split up into blocks,
so we not only have to binary search each block, but the blocks themselves to
find the right block with our object.  And we can't do this directed either
because we have no way of indexing the individual blocks, so worst case scenario
we're having to read in 268435456 blocks to then do a normal binary search on
_each_ block!  Now lets again say just to give fractal trees an added benefit
that each block has it's own min/max setting, we only have to binary search the
one that will have our actual data, but we're still talking about reading in
33554432 times the number of blocks as compared to a btree!!

And this doesn't even begin to touch the ability to handle multi-threaded
workloads.  Imagine having to move all of these blocks around in memory because
your insert created a new row.  You'd have to lock each row as you move stuff
around (at the very least), so if you hit a particularly hot row your
multithreaded performance is going to look like you are running on an Atari
2600.

So no I don't think it's time for someone to get started on ftrfs.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/28/2012 02:42 PM, Josef Bacik wrote:
 On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
 The case has been made on Phoronix for F-Trees: They makes use
 hard drive speeds, not (relatively slow) access times; beat
 SSD's; and scale perfectly across multiple cores with hundreds of
 millions of entries.

 So no I don't think it's time for someone to get started on ftrfs.
 Thanks,

Thanks for doing the research to back up the answer I was going to
respond with based on intuition.

+1

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPc1xgAAoJEB57S2MheeWydJIP/R7jHiloaPLfwffuFS3gobmy
h0DqX8YezGMs70kLBYTXeu3EVuFtX8IFGJDGoVp3pkM9nS6F6iK9LbRWDC7IDAy7
t1taQoUqy0HMmmrkXfZ8KWmXWv424/Zq5aHfnegL/oq4OrUwnHtjzJBbsx4fvezW
mnPrNlOxaLWgXSyUs+1hDjvcgfmnRjpgURHfsfGNgX3gTUE4xNKCFXD4zs0bs5YO
NcpLa/66R5UwINPLOazHt9QpC2jHxPb0j93YZBipwtbtXPj1W+od/0rOgPvc9gaK
hzi7kRiegt50V2LVOJnofdaBC0AlCfRlcXRUNOil1vgFe6s8sft1KOdl8MShQ/+t
JUTjb1j7Z3efds42nnahvNj8wV4T6z6qLlhSQiOulYbVarBsfrWoM3EJY2ObzIlM
uOjCIPwHr/fzMJ9wVyLgK2ksssZmYAVgQ3YyS9G1CHIPe9xgW9Ld570mhhXRoLRj
HG5QZkfkmFzlde+S/gGgrdrvTWDT1zDsUhe+IGYu3jtPjXVs1udB+BN3c7rTWjf+
gQUOdcS/6XHoTXWpgZ7p5DUb8qF7Nv+ABBcEWeiaIEPH7uUZ22/XZEdf2euSCTp2
+TRZfoj9PZ17cszkllG4n+kc5H4gdKMYRyvNQo9mQ2TvogsmVOgIY+0Fys2ZdOkF
CaZ6ti9ZLkofawzImOV5
=UaEh
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 02:42:04PM -0400, Josef Bacik wrote:
 On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
  The case has been made on Phoronix for F-Trees: They makes use hard
  drive speeds, not (relatively slow) access times; beat SSD's; and scale 
  perfectly across multiple cores with hundreds of millions of entries.
  
  http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes
  
  How TokuDB Fractal Tree Databases Work
  
  Via: http://www.phoronix.com/scan.php?page=news_itempx=MTA3NjM
  
  Time for someone to get started on ftrfs? Or can it be implemented 
  in Btrfs? 
  https://bugzilla.kernel.org/show_bug.cgi?id=43004
  
 
 This is dumber shit than usual out of phoronix.  I did some just basic
 calculations for worst case scenarios, let's say we max out btrfs's 8 level
 limit, so we have 7 levels of nodes and then our leaves.  Lets just for
 simplicity sake say we can fit 1 billion objects into this kind of tree, and 
 for
 each node/leaf we can only hold 10 objects.  (Keep in mind these are just 
 random
 numbers I'm pulling out of my ass and may not work out right with such a large
 tree, just work with me here).
 
 So worst case scenario doing a binary search for an object with 10 objects is 
 4
 comparisons, so with a full 8 level btree we're doing 4 comparisons per level,
 so 32 comparisons worst possible case to find our object.
 
 Now let's consider a fractal tree with 1 billion objects.  So that would have 
 a
 29 row fractal tree (if I did my math right).  Now I have to make some
 assumptions about how you would implement a fractal tree here, but let's 
 assume
 that every level tells you it's min and max value to make it easier to tell
 which level you need to search in.  So worst case you're object is in the 29th
 level, so you have to read 29 blocks in and check the min/max levels to figure
 out which row to search.  Let's be fair and say these are in memory, we're 
 still
 doing 29 comparisons right out of the gate to find the right row.  Now we have
 the right row, but this is a file system and the rows are split up into 
 blocks,
 so we not only have to binary search each block, but the blocks themselves to
 find the right block with our object.  And we can't do this directed either
 because we have no way of indexing the individual blocks, so worst case 
 scenario
 we're having to read in 268435456 blocks to then do a normal binary search on
 _each_ block!  Now lets again say just to give fractal trees an added benefit
 that each block has it's own min/max setting, we only have to binary search 
 the
 one that will have our actual data, but we're still talking about reading in
 33554432 times the number of blocks as compared to a btree!!
 

Ok looks like I wasn't being completely fair here, part of the presentation they
show talks about using forward pointers to point to where the element would be
in the next row.  So if we assume we're using forward pointers and every row has
a min/max indicator you can walk down each row and do a binary search to get as
close as possible to what you want and then follow the forward pointer down to
the next level.  The problem with this is that in the worst case you end up
having to binary search on each row, which makes the SMP case even crappier
because you have to make sure nothing changes while you are walking down the
rows and following the forward pointers.  The math here is a little trickier
since I can't easily picture what the worst case scenario would be per level,
but lets say O(log N/2) where N is the number of elements in the row.  So in the
situation I describe you are looking at having to do minimum of 29 reads, one
for each row, and then add into that all the reads you need to binary search
from your forward pointer on to the end of the row, which is going to increase
as you go down the tree.  So probably not millions times more work than b-trees,
but at least 10s if not 100s of times more work in the worst case.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Zach Brown



but lets say O(log N/2) where N is the number of elements in the row.
So in the situation I describe you are looking at having to do minimum
of 29 reads, one for each row,


Hmm.

Levels are powers of two and are either full or empty.  So the total
item count tells you which levels are full or empty.

(gdb) print/t 10
$3 = 1110111001101011001010

So no, definitely not reading 29 levels.

After realizing this, I have to be honest: I'm not finding your
hand-wavey dismissal of the technology convincing at all. :)

There's a *ton* of detail that they don't describe about how to actually
translate the logical notion of power-of-two node sizes into coherent
block IO that scales.  I don't doubt that it's possible.

I very much doubt that the required complexity and cpu cost in the
kernel would be worth it for generic file system users, though.  Hooo
boy, do I doubt it.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 03:50:07PM -0400, Zach Brown wrote:
 
 but lets say O(log N/2) where N is the number of elements in the row.
 So in the situation I describe you are looking at having to do minimum
 of 29 reads, one for each row,
 
 Hmm.
 
 Levels are powers of two and are either full or empty.  So the total
 item count tells you which levels are full or empty.
 
 (gdb) print/t 10
 $3 = 1110111001101011001010
 
 So no, definitely not reading 29 levels.


You are still going to have to have at least 29 levels to accomodate 1 billion
objects, though they won't all be full (sorry I missed the must be full or empty
bit).  So it looks like we'll have to actually search what 13 rows right?  So
still more rows than a b-tree, and again you are talking about binary searching
each row's blocks and then following its forward pointer down to the next one,
so maybe not 100's of times slower than a b-tree, but we're not talking about a
5-10% difference either, probably still measured in multiples.

 After realizing this, I have to be honest: I'm not finding your
 hand-wavey dismissal of the technology convincing at all. :)
 

Hah my math was a little off I'll grant you but the basic idea still stands,
once we start getting into larger and larger rows we're going to have to binary
search just to find the right _block_ to use, which in my mind is much more of a
problem than having to binary search inside of multiple blocks.  At the very
least as the number of objects grows the time it takes to find something in the
tree increases exponentially.

 There's a *ton* of detail that they don't describe about how to actually
 translate the logical notion of power-of-two node sizes into coherent
 block IO that scales.  I don't doubt that it's possible.

I imagine there is, but based on what little information they've shown I don't
see how it's a hands down win against b-trees.  If anything we're talking about
having to solve really complex problems in order to get any sort of good
performance out of this thing.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Zach Brown



I imagine there is, but based on what little information they've shown
I don't see how it's a hands down win against b-trees.  If anything
we're talking about having to solve really complex problems in order
to get any sort of good performance out of this thing.


Oh, absolutely.  Tack on COW and online repair and the complexity goes
through the roof.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Niels de Carpentier


 You are still going to have to have at least 29 levels to accomodate 1
 billion
 objects, though they won't all be full (sorry I missed the must be full or
 empty
 bit).  So it looks like we'll have to actually search what 13 rows right?
 So
 still more rows than a b-tree, and again you are talking about binary
 searching
 each row's blocks and then following its forward pointer down to the next
 one,
 so maybe not 100's of times slower than a b-tree, but we're not talking
 about a
 5-10% difference either, probably still measured in multiples.

They say that they can do the block pointers in a way that limits the
number for searches per row to a constant, so it's not that bad. They do
less random seeks, but bigger ones at the cost of more cpu usage.

 Hah my math was a little off I'll grant you but the basic idea still
 stands,
 once we start getting into larger and larger rows we're going to have to
 binary
 search just to find the right _block_ to use, which in my mind is much
 more of a
 problem than having to binary search inside of multiple blocks.  At the
 very
 least as the number of objects grows the time it takes to find something
 in the
 tree increases exponentially.

That's not what they say. A lot of info is missing though.

 There's a *ton* of detail that they don't describe about how to actually
 translate the logical notion of power-of-two node sizes into coherent
 block IO that scales.  I don't doubt that it's possible.

 I imagine there is, but based on what little information they've shown I
 don't
 see how it's a hands down win against b-trees.  If anything we're talking
 about
 having to solve really complex problems in order to get any sort of good
 performance out of this thing.  Thanks,

They don't claim to win hands down for the search case, they say they are
similar for the search case, but much better at inserts.

I don't think it's good for the linux fs general use case, but it's not as
bad as you think. But it will be very hard to implement anyway unless they
release more details.

Niels

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 10:44:03PM +0200, Niels de Carpentier wrote:
 
 
  You are still going to have to have at least 29 levels to accomodate 1
  billion
  objects, though they won't all be full (sorry I missed the must be full or
  empty
  bit).  So it looks like we'll have to actually search what 13 rows right?
  So
  still more rows than a b-tree, and again you are talking about binary
  searching
  each row's blocks and then following its forward pointer down to the next
  one,
  so maybe not 100's of times slower than a b-tree, but we're not talking
  about a
  5-10% difference either, probably still measured in multiples.
 
 They say that they can do the block pointers in a way that limits the
 number for searches per row to a constant, so it's not that bad. They do
 less random seeks, but bigger ones at the cost of more cpu usage.

I'd like to see how they do that.  The fact is you are still going to get random
seeks since you have to binary search the blocks in an entire row since there is
no way you can read a several thousand block row into memory to search it, so
once your rows get pretty big you are doing just as much if not more random io
as a btree.

 
  Hah my math was a little off I'll grant you but the basic idea still
  stands,
  once we start getting into larger and larger rows we're going to have to
  binary
  search just to find the right _block_ to use, which in my mind is much
  more of a
  problem than having to binary search inside of multiple blocks.  At the
  very
  least as the number of objects grows the time it takes to find something
  in the
  tree increases exponentially.
 
 That's not what they say. A lot of info is missing though.
 
  There's a *ton* of detail that they don't describe about how to actually
  translate the logical notion of power-of-two node sizes into coherent
  block IO that scales.  I don't doubt that it's possible.
 
  I imagine there is, but based on what little information they've shown I
  don't
  see how it's a hands down win against b-trees.  If anything we're talking
  about
  having to solve really complex problems in order to get any sort of good
  performance out of this thing.  Thanks,
 
 They don't claim to win hands down for the search case, they say they are
 similar for the search case, but much better at inserts.
 
 I don't think it's good for the linux fs general use case, but it's not as
 bad as you think. But it will be very hard to implement anyway unless they
 release more details.


I don't buy that its better in the insert case either for similar reasons, you
are talking about having to rewrite entire rows to maintain the sequential
nature of everything, so if you end up adding a giant row you have to go and
write the whole thing out again, so you are talking about a lot more IO than
with b-trees, albeit sequential.  So maybe it's a win since it's sequential but
I wonder at what tree size that benefit no longer exists.

Of course all this analysis is based on a power point presentation and there may
be some detail I'm missing, but that's mostly my point, claiming that b-trees
are now obsolete because we can maybe do inserts faster is just idiotic.
Thanks,

Josef 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Niels de Carpentier
 I'd like to see how they do that.  The fact is you are still going to get
 random
 seeks since you have to binary search the blocks in an entire row since
 there is
 no way you can read a several thousand block row into memory to search it,
 so
 once your rows get pretty big you are doing just as much if not more
 random io
 as a btree.

Why? The rows are sequential on disk, so you're never doing more random
seeks than the number of rows as long as you can search faster than the
disk can provide data. If they indeed can limit the number of searches per
row to a constant, it shouldn't be an issue at all.

 I don't buy that its better in the insert case either for similar reasons,
 you
 are talking about having to rewrite entire rows to maintain the sequential
 nature of everything, so if you end up adding a giant row you have to go
 and
 write the whole thing out again, so you are talking about a lot more IO
 than
 with b-trees, albeit sequential.  So maybe it's a win since it's
 sequential but
 I wonder at what tree size that benefit no longer exists.

The worst case might be bad, but on average it should be good. I don't
doubt it is always better on rotating disks, probably not on ssd's.

 Of course all this analysis is based on a power point presentation and
 there may
 be some detail I'm missing, but that's mostly my point, claiming that
 b-trees
 are now obsolete because we can maybe do inserts faster is just idiotic.

It's a presentation from a company, so lots of marketing. Of course it
doesn't make btrees obsolete, it's optimized for specific cases. Like they
say they reduce random seeks at the cost of disk bandwidth and cpu usage.
It depends on the use case if that is useful or not. Probably not for
btrfs, since the future will be ssd's, and meta data will generally be
cached.

Niels

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html