from:"Vladislav Bolkhovitin"



Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as "barrier". It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as barrier. It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to flags parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as barrier. It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware = less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because there's no
difference where to drain the queue, on the kernel or the storage side?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word barrier for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers


杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
topological order, and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/01/2012 05:24 PM wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


OK, then we have a good progress!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the gold-plated storage manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5 hotplug drives in a 5.25 bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because there's no 
difference where to drain the queue, on the kernel or the storage side?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Alan Cox, on 11/01/2012 05:24 PM wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because there's no
difference where to drain the queue, on the kernel or the storage side?


Send patches.


OK, then we have a good progress!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers

2012-10-30 Thread Vladislav Bolkhovitin



Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers

2012-10-30 Thread Vladislav Bolkhovitin



Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the gold-plated storage manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But affordable and useful are completely 
orthogonal things. The high end features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from value-for-money, but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices high-end.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
barriers concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive enterprise
drives, it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers



Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers

2012-10-23 Thread Vladislav Bolkhovitin


杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [sqlite] light weight write barriers

2012-10-23 Thread Vladislav Bolkhovitin


杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully fill the pipe, using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
barriers concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] target/file: Re-enable optional fd_buffered_io=1 operation

2012-10-02 Thread Vladislav Bolkhovitin


Christoph Hellwig, on 10/01/2012 04:46 AM wrote:

On Sun, Sep 30, 2012 at 05:58:11AM +, Nicholas A. Bellinger wrote:

From: Nicholas Bellinger

This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.

This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.


What this lacks is a clear reason why you would enable this inherently
unsafe mode.  While there is some clear precedence to allow people doing
stupid thing I'd least like a rationale for it, and it being documented
as unsafe.


Nowadays nearly all serious applications are transactional, and know how to flush 
storage cache between transactions. That means that write back caching is 
absolutely safe for them. No data can't be lost in any circumstances.


Welcome to the 21 century, Christoph!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] target/file: Re-enable optional fd_buffered_io=1 operation

2012-10-02 Thread Vladislav Bolkhovitin


Christoph Hellwig, on 10/01/2012 04:46 AM wrote:

On Sun, Sep 30, 2012 at 05:58:11AM +, Nicholas A. Bellinger wrote:

From: Nicholas Bellingern...@linux-iscsi.org

This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.

This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.


What this lacks is a clear reason why you would enable this inherently
unsafe mode.  While there is some clear precedence to allow people doing
stupid thing I'd least like a rationale for it, and it being documented
as unsafe.


Nowadays nearly all serious applications are transactional, and know how to flush 
storage cache between transactions. That means that write back caching is 
absolutely safe for them. No data can't be lost in any circumstances.


Welcome to the 21 century, Christoph!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Vladislav Bolkhovitin


Tomasz Chmielewski wrote:

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.



Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.



After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually 
remove about 5-20 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.



After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).



Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?



Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?


It would be really interesting if you try your workload with XFS. In my 
experience, XFS considerably outperforms ext3 on big (> few hundreds MB) 
disks.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Vladislav Bolkhovitin


Tomasz Chmielewski wrote:

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.



Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.



After cache is emptied (echo 3  /proc/sys/vm/drop_caches) I can usually 
remove about 5-20 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.



After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).



Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?



Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?


It would be really interesting if you try your workload with XFS. In my 
experience, XFS considerably outperforms ext3 on big ( few hundreds MB) 
disks.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-11 Thread Vladislav Bolkhovitin


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which


does NOT


issue commands to sub-target devices via the SCSI


mid-layer, but


bypasses it completely?


What do you mean? To call directly low level backstorage
SCSI drivers 
queuecommand() routine? What are advantages of it?


Yes, that's what I meant.  Just curious.


What's advantage of it?


Thanks,
   Luben

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-11 Thread Vladislav Bolkhovitin


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which


does NOT


issue commands to sub-target devices via the SCSI


mid-layer, but


bypasses it completely?


What do you mean? To call directly low level backstorage
SCSI drivers 
queuecommand() routine? What are advantages of it?


Yes, that's what I meant.  Just curious.


What's advantage of it?


Thanks,
   Luben

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.




I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


nod

I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin


James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin


James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:
iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


Actually, think about those multiple connections...  we already had to 
implement fast-failover (and load bal) SCSI multi-pathing at a higher 
level.  IMO that portion of the protocol is redundant:   You need the 
same capability elsewhere in the OS _anyway_, if you are to support 
multi-pathing.


I'm thinking about MC/S as about a way to improve performance using 
several physical links. There's no other way, except MC/S, to keep 
commands processing order in that case. So, it's really valuable 
property of iSCSI, although with a limited application.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

Erez Zilber wrote:

Bart Van Assche wrote:

As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com_name=scst-devel.

Sorry for the late response (but better late than never).

One may claim that STGT should have lower performance than SCST because
its data path is from userspace. However, your results show that for
non-IB transports, they both show the same numbers. Furthermore, with IB
there shouldn't be any additional difference between the 2 targets
because data transfer from userspace is as efficient as data transfer
from kernel space.

And now consider if one target has zero-copy cached I/O. How much that
will improve its performance?

The only explanation that I see is that fine tuning for iSCSI & iSER is
required. As was already mentioned in this thread, with SDR you can get
~900 MB/sec with iSER (on STGT).

Erez

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:

Alan Cox wrote:

better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.



AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).

But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
really do tagged queueing, etc.



iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


And now think about iSER, which brings iSCSI on the whole new complexity 
level ;)


It's an Internet protocol designed 
by storage designers, what do you expect?


For years I have been hoping that someone will invent a simple protocol 
(w/ strong auth) that can transit ATA and SCSI commands and responses. 
Heck, it would be almost trivial if the kernel had a TLS/SSL implementation.


Jeff

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:

I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem



.. not even shared. It was hard to get correct semantics full stop. 

Which is a traditional problem. The thing is, the kernel always has some 
internal state, and it's hard to expose all the semantics that the kernel 
knows about to user space.


So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?


The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having "file handles" (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.


Yes, there is something like that for SCSI target as well. It's a "local 
initiator" or "local nexus", see 
http://thread.gmane.org/gmane.linux.scsi/31288 and 
http://news.gmane.org/find-root.php?message_id=%3c463F36AC.3010207%40vlnb.net%3e 
for more info about that.


In fact, existence of local nexus is one more point why SCST is better, 
than STGT, because for STGT it's pretty hard to support it (all locally 
generated commands would have to be passed through its daemon, which 
would be a total disaster for performance), while for SCST it can be 
done relatively simply.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:
So just going by what has happened in the past, I'd assume that iSCSI 
would eventually turn into "connecting/authentication in user space" with 
"data transfers in kernel space".


This is exactly how iSCSI-SCST (iSCSI target driver for SCST) is 
implemented, credits to IET and Ardis target developers.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel