Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread David Greaves

Paul Clements wrote:
Well, if people would like to see a timeout option, I actually coded up 
a patch a couple of years ago to do just that, but I never got it into 
mainline because you can do almost as well by doing a check at 
user-level (I basically ping the nbd connection periodically and if it 
fails, I kill -9 the nbd-client).



Yes please.

David

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread David Greaves

[EMAIL PROTECTED] wrote:
per the message below MD (or DM) would need to be modified to work 
reasonably well with one of the disk components being over an unreliable 
link (like a network link)


are the MD/DM maintainers interested in extending their code in this 
direction? or would they prefer to keep it simpler by being able to 
continue to assume that the raid components are connected over a highly 
reliable connection?


if they are interested in adding (and maintaining) this functionality 
then there is a real possibility that NBD+MD/DM could eliminate the need 
for DRDB. however if they are not interested in adding all the code to 
deal with the network type issues, then the argument that DRDB should 
not be merged becouse you can do the same thing with MD/DM + NBD is 
invalid and can be dropped/ignored


David Lang


As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory 
recalls sporadic issues with hot-plug leaving md hanging and certain lower level 
errors (or even very high latency) causing unsatisfactory behaviour in what is 
supposed to be a fault 'tolerant' subsystem.



Would this just be relevant to network devices or would it improve support for 
jostled usb and sata hot-plugging I wonder?


David

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread david

On Mon, 13 Aug 2007, David Greaves wrote:


[EMAIL PROTECTED] wrote:

 per the message below MD (or DM) would need to be modified to work
 reasonably well with one of the disk components being over an unreliable
 link (like a network link)

 are the MD/DM maintainers interested in extending their code in this
 direction? or would they prefer to keep it simpler by being able to
 continue to assume that the raid components are connected over a highly
 reliable connection?

 if they are interested in adding (and maintaining) this functionality then
 there is a real possibility that NBD+MD/DM could eliminate the need for
 DRDB. however if they are not interested in adding all the code to deal
 with the network type issues, then the argument that DRDB should not be
 merged becouse you can do the same thing with MD/DM + NBD is invalid and
 can be dropped/ignored

 David Lang


As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory 
recalls sporadic issues with hot-plug leaving md hanging and certain lower 
level errors (or even very high latency) causing unsatisfactory behaviour in 
what is supposed to be a fault 'tolerant' subsystem.



Would this just be relevant to network devices or would it improve support 
for jostled usb and sata hot-plugging I wonder?


good question, I suspect that some of the error handling would be similar 
(for devices that are unreachable not haning the system for example), but 
a lot of the rest would be different (do you really want to try to 
auto-resync to a drive that you _think_ just reappeared, what if it's a 
different drive? how can you be sure?) the error rate of a network is gong 
to be significantly higher then for USB or SATA drives (although I suppose 
iscsi would be limilar)


David Lang
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread David Greaves

[EMAIL PROTECTED] wrote:
Would this just be relevant to network devices or would it improve 
support for jostled usb and sata hot-plugging I wonder?


good question, I suspect that some of the error handling would be 
similar (for devices that are unreachable not haning the system for 
example), but a lot of the rest would be different (do you really want 
to try to auto-resync to a drive that you _think_ just reappeared,
Well, omit 'think' and the answer may be yes. A lot of systems are quite 
simple and RAID is common on the desktop now. If jostled USB fits into this 
category - then yes.


what 
if it's a different drive? how can you be sure?
And that's the key isn't it. We have the RAID device UUID and the superblock 
info. Isn't that enough? If not then given the work involved an extended 
superblock wouldn't be unreasonable.
And I suspect the capability of devices would need recording in the superblock 
too? eg 'retry-on-fail'
I can see how md would fail a device but may now periodically retry it. If a 
retry shows that it's back then it would validate it (UUID) and then resync it.


) the error rate of a 
network is gong to be significantly higher then for USB or SATA drives 
(although I suppose iscsi would be limilar)


I do agree - I was looking for value-add for the existing subsystem. If this 
benefits existing RAID users then it's more likely to be attractive.


David
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi
Lars Ellenberg wrote:
 meanwhile, please, anyone interessted,
 the drbd paper for LinuxConf Eu 2007 is finalized.
 http://www.drbd.org/fileadmin/drbd/publications/
 drbd8.linux-conf.eu.2007.pdf

 it does not give too much implementation detail (would be inapropriate
 for conference proceedings, imo; some paper commenting on the source
 code should follow).

 but it does give a good overview about what DRBD actually is,
 what exact problems it tries to solve,
 and what developments to expect in the near future.

 so you can make up your mind about
  Do we need it?, and
  Why DRBD? Why not NBD + MD-RAID?

Ok, conceptually your driver sounds really interresting, but when I read the 
pdf I got completely turned off.  The problem is that the concepts are not 
clearly implemented, when in fact the concepts are really simple:

  Allow shared access to remote block storage with fault tolerance.

The first thing to tackle here would be write serialization.  Then start 
thinking about fault tolerance.

Now, shared remote block access should theoretically be handled, as does 
DRBD, by a block layer driver, but realistically it may be more appropriate 
to let it be handled by the combining end user, like OCFS or GFS.

The idea here is to simplify lower layer implementations while removing any 
preconceived dependencies, and let upper layers reign free without incurring 
redundant overhead.

Look at ZFS; it illegally violates layering by combining md/dm/lvm with the 
fs, but it does this based on a realistic understanding of the problems 
involved, which enables it to improve performance, flexibility, and 
functionality specific to its use case.

This implies that there are two distinct forces at work here:

  1. Layer components
  2. Use-Case composers

Layer components should technically not implement any use case (other than 
providing a plumbing framework), as that would incur unnecessary 
dependencies, which could reduce its generality and thus reusability.

Use-Case composers can now leverage layer components from across the layering 
hierarchy, to yield a specific use case implementation.

DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, 
whereas aoe / nbd / loop and the VFS / FUSE are examples of layer 
components.

It follows that Use-case composers, like DRBD, need common functionality that 
should be factored out into layer components, and then recompose to 
implement a specific use case.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 13:35, Al Boldi wrote:
Lars Ellenberg wrote:
 meanwhile, please, anyone interessted,
 the drbd paper for LinuxConf Eu 2007 is finalized.
 http://www.drbd.org/fileadmin/drbd/publications/
 drbd8.linux-conf.eu.2007.pdf

 but it does give a good overview about what DRBD actually is,
 what exact problems it tries to solve,
 and what developments to expect in the near future.

 so you can make up your mind about
  Do we need it?, and
  Why DRBD? Why not NBD + MD-RAID?

I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?

Now, shared remote block access should theoretically be handled, as does 
DRBD, by a block layer driver, but realistically it may be more appropriate 
to let it be handled by the combining end user, like OCFS or GFS.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Evgeniy Polyakov
On Sun, Aug 12, 2007 at 01:35:17PM +0300, Al Boldi ([EMAIL PROTECTED]) wrote:
 Lars Ellenberg wrote:
  meanwhile, please, anyone interessted,
  the drbd paper for LinuxConf Eu 2007 is finalized.
  http://www.drbd.org/fileadmin/drbd/publications/
  drbd8.linux-conf.eu.2007.pdf
 
  it does not give too much implementation detail (would be inapropriate
  for conference proceedings, imo; some paper commenting on the source
  code should follow).
 
  but it does give a good overview about what DRBD actually is,
  what exact problems it tries to solve,
  and what developments to expect in the near future.
 
  so you can make up your mind about
   Do we need it?, and
   Why DRBD? Why not NBD + MD-RAID?
 
 Ok, conceptually your driver sounds really interresting, but when I read the 
 pdf I got completely turned off.  The problem is that the concepts are not 
 clearly implemented, when in fact the concepts are really simple:
 
   Allow shared access to remote block storage with fault tolerance.
 
 The first thing to tackle here would be write serialization.  Then start 
 thinking about fault tolerance.
 
 Now, shared remote block access should theoretically be handled, as does 
 DRBD, by a block layer driver, but realistically it may be more appropriate 
 to let it be handled by the combining end user, like OCFS or GFS.
 
 The idea here is to simplify lower layer implementations while removing any 
 preconceived dependencies, and let upper layers reign free without incurring 
 redundant overhead.
 
 Look at ZFS; it illegally violates layering by combining md/dm/lvm with the 
 fs, but it does this based on a realistic understanding of the problems 
 involved, which enables it to improve performance, flexibility, and 
 functionality specific to its use case.
 
 This implies that there are two distinct forces at work here:
 
   1. Layer components
   2. Use-Case composers
 
 Layer components should technically not implement any use case (other than 
 providing a plumbing framework), as that would incur unnecessary 
 dependencies, which could reduce its generality and thus reusability.
 
 Use-Case composers can now leverage layer components from across the layering 
 hierarchy, to yield a specific use case implementation.
 
 DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, 
 whereas aoe / nbd / loop and the VFS / FUSE are examples of layer 
 components.
 
 It follows that Use-case composers, like DRBD, need common functionality that 
 should be factored out into layer components, and then recompose to 
 implement a specific use case.

Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
on top of distributed storage (which is a urprise to me, that holy zfs
suppors that)?
 
 Thanks!
 
 --
 Al
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi
Evgeniy Polyakov wrote:
 Al Boldi ([EMAIL PROTECTED]) wrote:
  Look at ZFS; it illegally violates layering by combining md/dm/lvm with
  the fs, but it does this based on a realistic understanding of the
  problems involved, which enables it to improve performance, flexibility,
  and functionality specific to its use case.
 
  This implies that there are two distinct forces at work here:
 
1. Layer components
2. Use-Case composers
 
  Layer components should technically not implement any use case (other
  than providing a plumbing framework), as that would incur unnecessary
  dependencies, which could reduce its generality and thus reusability.
 
  Use-Case composers can now leverage layer components from across the
  layering hierarchy, to yield a specific use case implementation.
 
  DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in
  general, whereas aoe / nbd / loop and the VFS / FUSE are examples of
  layer components.
 
  It follows that Use-case composers, like DRBD, need common functionality
  that should be factored out into layer components, and then recompose to
  implement a specific use case.

 Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
 on top of distributed storage (which is a urprise to me, that holy zfs
 suppors that)?

Actually, I may not have been very clear in my Use-Case composer description 
to mean internal in-kernel Use-Case composer as opposed to external Userland 
Use-Case composer.

So, nbd+dm+raid1 would be an external Userland Use-Case composition, which 
obviously could have some drastic performance issues.

DRBD and ZFS are examples of internal in-kernel Use-Case composers, which 
obviously could show some drastic performance improvements.  

Although you could allow in-kernel Use-Case composers to be run on top of 
Userland Use-Case composers, that wouldn't be the preferred mode of 
operation.  Instead, you would for example recompose ZFS to incorporate an 
in-kernel distributed storage layer component, like nbd.

All this boils down to refactoring Use-Case composers to produce layer 
components with both in-kernel and userland interfaces.  Once we have that, 
it becomes a matter of plug-and-play to produce something awesome like ZFS.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david

On Sun, 12 Aug 2007, Jan Engelhardt wrote:


On Aug 12 2007 13:35, Al Boldi wrote:

Lars Ellenberg wrote:

meanwhile, please, anyone interessted,
the drbd paper for LinuxConf Eu 2007 is finalized.
http://www.drbd.org/fileadmin/drbd/publications/
drbd8.linux-conf.eu.2007.pdf

but it does give a good overview about what DRBD actually is,
what exact problems it tries to solve,
and what developments to expect in the near future.

so you can make up your mind about
 Do we need it?, and
 Why DRBD? Why not NBD + MD-RAID?


I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?


GFS is a distributed filesystem, DRDB is a replicated block device. you 
wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc


DRDB is much closer to the NBD+MD option.

now, I am not an expert on either option, but three are a couple things 
that I would question about the DRDB+MD option


1. when the remote machine is down, how does MD deal with it for reads and 
writes?


2. MD over local drive will alternate reads between mirrors (or so I've 
been told), doing so over the network is wrong.


3. when writing, will MD wait for the network I/O to get the data saved on 
the backup before returning from the syscall? or can it sync the data out 
lazily



Now, shared remote block access should theoretically be handled, as does
DRBD, by a block layer driver, but realistically it may be more appropriate
to let it be handled by the combining end user, like OCFS or GFS.


there are times when you want to replicate at the block layer, and there 
are times when you want to have a filesystem do the work. don't force a 
filesystem on use-cases where a block device is the right answer.


David Lang
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:

 now, I am not an expert on either option, but three are a couple things that I
 would question about the DRDB+MD option

 1. when the remote machine is down, how does MD deal with it for reads and
 writes?

I suppose it kicks the drive and you'd have to re-add it by hand unless done by
a cronjob.

 2. MD over local drive will alternate reads between mirrors (or so I've been
 told), doing so over the network is wrong.

Certainly. In which case you set write_mostly (or even write_only, not sure
of its name) on the raid component that is nbd.

 3. when writing, will MD wait for the network I/O to get the data saved on the
 backup before returning from the syscall? or can it sync the data out lazily

Can't answer this one - ask Neil :)




Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Iustin Pop
On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
 
 On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
 
  now, I am not an expert on either option, but three are a couple things 
  that I
  would question about the DRDB+MD option
 
  1. when the remote machine is down, how does MD deal with it for reads and
  writes?
 
 I suppose it kicks the drive and you'd have to re-add it by hand unless done 
 by
 a cronjob.

From my tests, since NBD doesn't have a timeout option, MD hangs in the
write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.

  2. MD over local drive will alternate reads between mirrors (or so I've been
  told), doing so over the network is wrong.
 
 Certainly. In which case you set write_mostly (or even write_only, not sure
 of its name) on the raid component that is nbd.
 
  3. when writing, will MD wait for the network I/O to get the data saved on 
  the
  backup before returning from the syscall? or can it sync the data out lazily
 
 Can't answer this one - ask Neil :)

MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


In my experience DRBD wins hands-down over MD+NBD because of MD doesn't
know (or handle) a component that never returns from a write, which is
quite different from returning with an error. Furthermore, DRBD was
designed to handle transient errors in the connection to the peer due to
its network-oriented design, whereas MD is mostly designed with local or
at least high-reliability disks (where disk can be SAN, SCSI, etc.) and
a failure is not normal for MD. Thus the need for manual reconnect in MD
case and the automated handling of reconnects in case of DRBD.

I'm just a happy user of both MD over local disks and DRBD for networked
raid.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Paul Clements

Iustin Pop wrote:

On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:

On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:

now, I am not an expert on either option, but three are a couple things that I
would question about the DRDB+MD option

1. when the remote machine is down, how does MD deal with it for reads and
writes?

I suppose it kicks the drive and you'd have to re-add it by hand unless done by
a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the 
blocks that have been written while the connection was down.




From my tests, since NBD doesn't have a timeout option, MD hangs in the

write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up 
a patch a couple of years ago to do just that, but I never got it into 
mainline because you can do almost as well by doing a check at 
user-level (I basically ping the nbd connection periodically and if it 
fails, I kill -9 the nbd-client).




2. MD over local drive will alternate reads between mirrors (or so I've been
told), doing so over the network is wrong.

Certainly. In which case you set write_mostly (or even write_only, not sure
of its name) on the raid component that is nbd.


3. when writing, will MD wait for the network I/O to get the data saved on the
backup before returning from the syscall? or can it sync the data out lazily

Can't answer this one - ask Neil :)


MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as 
much data as you have RAM to hold. At a certain point, presumably, you'd 
want to just break the mirror and take the hit of doing a resync once 
your network leg falls too far behind.


--
Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david
per the message below MD (or DM) would need to be modified to work 
reasonably well with one of the disk components being over an unreliable 
link (like a network link)


are the MD/DM maintainers interested in extending their code in this 
direction? or would they prefer to keep it simpler by being able to 
continue to assume that the raid components are connected over a highly 
reliable connection?


if they are interested in adding (and maintaining) this functionality then 
there is a real possibility that NBD+MD/DM could eliminate the need for 
DRDB. however if they are not interested in adding all the code to deal 
with the network type issues, then the argument that DRDB should not be 
merged becouse you can do the same thing with MD/DM + NBD is invalid and 
can be dropped/ignored


David Lang

On Sun, 12 Aug 2007, Paul Clements wrote:


Iustin Pop wrote:

 On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
  On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
   now, I am not an expert on either option, but three are a couple 
   things that I

   would question about the DRDB+MD option
  
   1. when the remote machine is down, how does MD deal with it for reads 
   and

   writes?
  I suppose it kicks the drive and you'd have to re-add it by hand unless 
  done by

  a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the blocks 
that have been written while the connection was down.




From my tests, since NBD doesn't have a timeout option, MD hangs in the
 write to that mirror indefinitely, somewhat like when dealing with a
 broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up a 
patch a couple of years ago to do just that, but I never got it into mainline 
because you can do almost as well by doing a check at user-level (I basically 
ping the nbd connection periodically and if it fails, I kill -9 the 
nbd-client).



   2. MD over local drive will alternate reads between mirrors (or so 
   I've been

   told), doing so over the network is wrong.
  Certainly. In which case you set write_mostly (or even write_only, not 
  sure

  of its name) on the raid component that is nbd.
 
   3. when writing, will MD wait for the network I/O to get the data 
   saved on the
   backup before returning from the syscall? or can it sync the data out 
   lazily

  Can't answer this one - ask Neil :)

 MD has the write-mostly/write-behind options - which help in this case
 but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as much 
data as you have RAM to hold. At a certain point, presumably, you'd want to 
just break the mirror and take the hit of doing a resync once your network 
leg falls too far behind.


--
Paul


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html