[ANNOUNCE]: SCST 3.3 pre-release freeze

2017-08-31 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.3 pre-release code freeze in the SCST SVN branch 
3.3.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.3.x

It is going to be released after few weeks of testing, if no significant issues 
found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which can provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Many recently developed SAN appliances, especially higher end 
ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to Bart Van Assche and 
SanDisk/Western Digital!

Vlad



[ANNOUNCE]: SCST 3.3 pre-release freeze

2017-08-31 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.3 pre-release code freeze in the SCST SVN branch 
3.3.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.3.x

It is going to be released after few weeks of testing, if no significant issues 
found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which can provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Many recently developed SAN appliances, especially higher end 
ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to Bart Van Assche and 
SanDisk/Western Digital!

Vlad



[ANNOUNCE]: SCST 3.2 released

2016-12-15 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.2 has just been released

You can download it from http://scst.sourceforge.net/downloads.html

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Many of 
modern
SAN appliances, especially higher end ones, are SCST based. It might well be 
that your
favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk/WDC for the great 
support!

Vlad



[ANNOUNCE]: SCST 3.2 released

2016-12-15 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.2 has just been released

You can download it from http://scst.sourceforge.net/downloads.html

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Many of 
modern
SAN appliances, especially higher end ones, are SCST based. It might well be 
that your
favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk/WDC for the great 
support!

Vlad



[ANNOUNCE]: SCST 3.2 pre-release freeze

2016-08-02 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.2 pre-release code freeze in the SCST SVN branch 
3.2.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.2.x

It is going to be released after few weeks of testing, if no significant issues 
found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which can provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Majority of recently developed SAN appliances, especially higher 
end ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk/WDC for the great 
support!

Vlad



[ANNOUNCE]: SCST 3.2 pre-release freeze

2016-08-02 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.2 pre-release code freeze in the SCST SVN branch 
3.2.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.2.x

It is going to be released after few weeks of testing, if no significant issues 
found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which can provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Majority of recently developed SAN appliances, especially higher 
end ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk/WDC for the great 
support!

Vlad



[ANNOUNCE]: SCST 3.1 release

2016-01-21 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce that SCST version 3.1 has just been released and available 
for
download from http://scst.sourceforge.net/downloads.html.

Highlights for this release:

 - Cluster support for SCSI reservations. This feature is essential for 
initiator-side
clustering approaches based on persistent reservations, e.g. the quorum disk
implementation in Windows Clustering.

 - Full support for VAAI or vStorage API for Array Integration: Extended Copy 
command
support has been added as well as performance of WRITE SAME and of Atomic Test 
& Set,
also known as COMPARE AND WRITE, has been improved.

 - T10-PI support has been added.

 - ALUA support has been improved: explicit ALUA (SET TARGET PORT GROUPS 
command) has
been added and DRBD compatibility has been improved.

 - SCST events user space infrastructure has been added, so now SCST can notify 
a user
space agent about important internal and fabric events.

 - QLogic target driver has been significantly improved.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk for the great support! 
All
above highlights development was supported by SanDisk.

Vlad



[ANNOUNCE]: SCST 3.1 release

2016-01-21 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce that SCST version 3.1 has just been released and available 
for
download from http://scst.sourceforge.net/downloads.html.

Highlights for this release:

 - Cluster support for SCSI reservations. This feature is essential for 
initiator-side
clustering approaches based on persistent reservations, e.g. the quorum disk
implementation in Windows Clustering.

 - Full support for VAAI or vStorage API for Array Integration: Extended Copy 
command
support has been added as well as performance of WRITE SAME and of Atomic Test 
& Set,
also known as COMPARE AND WRITE, has been improved.

 - T10-PI support has been added.

 - ALUA support has been improved: explicit ALUA (SET TARGET PORT GROUPS 
command) has
been added and DRBD compatibility has been improved.

 - SCST events user space infrastructure has been added, so now SCST can notify 
a user
space agent about important internal and fabric events.

 - QLogic target driver has been significantly improved.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk for the great support! 
All
above highlights development was supported by SanDisk.

Vlad



Re: [ANNOUNCE]: SCST 3.1 pre-release freeze

2015-11-06 Thread Vladislav Bolkhovitin
Hi,

Bike & Snow wrote on 11/06/2015 10:55 AM:
> Hello Vlad
> 
> Excellent news on all the updates.
> 
> Regarding this:
> - QLogic target driver has been significantly improved.
> 
> Does that mean I should stop building the QLogic target driver from here?
> git://git.qlogic.com/scst-qla2xxx.git 
> 
> Or are you saying the git.qlogic.com  has been 
> improved?

It is saying that qla2x00t was improved.

The ultimate goal is to have the mainstream (git) QLogic target driver to be 
the main
and the only QLogic target driver, but, unfortunately, this driver not yet 
reached
level of quality and maturity of qla2x00t. We with QLogic are working toward it.

> If I stop building the one from git.qlogic.com , does 
> the 3.2.0
> one support NPIV?

Yes, it has full NPIV support.

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE]: SCST 3.1 pre-release freeze

2015-11-06 Thread Vladislav Bolkhovitin
Hi,

Bike & Snow wrote on 11/06/2015 10:55 AM:
> Hello Vlad
> 
> Excellent news on all the updates.
> 
> Regarding this:
> - QLogic target driver has been significantly improved.
> 
> Does that mean I should stop building the QLogic target driver from here?
> git://git.qlogic.com/scst-qla2xxx.git 
> 
> Or are you saying the git.qlogic.com  has been 
> improved?

It is saying that qla2x00t was improved.

The ultimate goal is to have the mainstream (git) QLogic target driver to be 
the main
and the only QLogic target driver, but, unfortunately, this driver not yet 
reached
level of quality and maturity of qla2x00t. We with QLogic are working toward it.

> If I stop building the one from git.qlogic.com , does 
> the 3.2.0
> one support NPIV?

Yes, it has full NPIV support.

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.1 pre-release freeze

2015-11-05 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.1 pre-release code freeze in the SCST SVN branch 
3.0.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.1.x

It is going to be released after few weeks of testing, if no significant issues 
found.

Highlights for this release:

 - Cluster support for SCSI reservations. This feature is essential for 
initiator-side
clustering approaches based on persistent reservations, e.g. the quorum disk
implementation in Windows Clustering.

 - Full support for VAAI or vStorage API for Array Integration: Extended Copy 
command
support has been added as well as performance of WRITE SAME and of Atomic Test 
& Set,
also known as COMPARE AND WRITE, has been improved.

 - T10-PI support has been added.

 - ALUA support has been improved: explicit ALUA (SET TARGET PORT GROUPS 
command) has
been added and DRBD compatibility has been improved.

 - SCST events user space infrastructure has been added, so now SCST can notify 
a user
space agent about important internal and fabric events.

 - QLogic target driver has been significantly improved.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk for the great support! 
All
above highlights development was supported by SanDisk.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.1 pre-release freeze

2015-11-05 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.1 pre-release code freeze in the SCST SVN branch 
3.0.x.

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.1.x

It is going to be released after few weeks of testing, if no significant issues 
found.

Highlights for this release:

 - Cluster support for SCSI reservations. This feature is essential for 
initiator-side
clustering approaches based on persistent reservations, e.g. the quorum disk
implementation in Windows Clustering.

 - Full support for VAAI or vStorage API for Array Integration: Extended Copy 
command
support has been added as well as performance of WRITE SAME and of Atomic Test 
& Set,
also known as COMPARE AND WRITE, has been improved.

 - T10-PI support has been added.

 - ALUA support has been improved: explicit ALUA (SET TARGET PORT GROUPS 
command) has
been added and DRBD compatibility has been improved.

 - SCST events user space infrastructure has been added, so now SCST can notify 
a user
space agent about important internal and fabric events.

 - QLogic target driver has been significantly improved.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which can provide advanced functionality, like replication, 
thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially to SanDisk for the great support! 
All
above highlights development was supported by SanDisk.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.0.1 released

2015-02-24 Thread Vladislav Bolkhovitin
I'm glad to announce that maintenance update for SCST and its drivers 3.0.1 has 
just
been released and ready for download from 
http://scst.sourceforge.net/downloads.html.
All SCST users are encouraged to update.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which provide advanced functionality, like replication, thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially Bart Van Assche!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.0.1 released

2015-02-24 Thread Vladislav Bolkhovitin
I'm glad to announce that maintenance update for SCST and its drivers 3.0.1 has 
just
been released and ready for download from 
http://scst.sourceforge.net/downloads.html.
All SCST users are encouraged to update.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated
storage devices, which provide advanced functionality, like replication, thin
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of
recently developed SAN appliances, especially higher end ones, are SCST based. 
It might
well be that your favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially Bart Van Assche!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE]: SCST 3.0 released

2014-09-21 Thread Vladislav Bolkhovitin
No, because it's too new, but you can always get it from the git. Or you 
can use stable Emulex driver for 16Gb connectivity. It's not in the 
bundle only because of the Emulex policy.


Thanks,
Vlad

On 9/19/2014 23:59, scst.n...@gmail.com wrote:

Does 16Gb qla2x00t included?

发自我的小米手机

Vladislav Bolkhovitin 于 2014-9-20 下午2:39写道:

Hi All,

I'm glad to announce that SCST 3.0 has just been released. This
release includes SCST
core, target drivers iSCSI-SCST for iSCSI, including iSER support
(thanks to
Mellanox!), qla2x00t for QLogic Fibre Channel adapters, ib_srpt for
InfiniBand SRP,
fcst for FCoE and scst_local for local loopback-like access as well
as SCST management
utility scstadmin. Also separately you can download from Emulex
development portal
stable and fully functional target driver for the current generation
of Emulex Fibre
Channel adapters.

SCST is alternative SCSI target stack for Linux. SCST allows
creation of sophisticated
storage devices, which provide advanced functionality, like
replication, thin
provisioning, deduplication, high availability, automatic backup,
etc. Majority of
recently developed SAN appliances, especially higher end ones, are
SCST based. It might
well be that your favorite storage appliance running SCST in the
firmware.

More info about SCST and its modules you can find on:
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE]: SCST 3.0 released

2014-09-21 Thread Vladislav Bolkhovitin
No, because it's too new, but you can always get it from the git. Or you 
can use stable Emulex driver for 16Gb connectivity. It's not in the 
bundle only because of the Emulex policy.


Thanks,
Vlad

On 9/19/2014 23:59, scst.n...@gmail.com wrote:

Does 16Gb qla2x00t included?

发自我的小米手机

Vladislav Bolkhovitin v...@vlnb.net于 2014-9-20 下午2:39写道:

Hi All,

I'm glad to announce that SCST 3.0 has just been released. This
release includes SCST
core, target drivers iSCSI-SCST for iSCSI, including iSER support
(thanks to
Mellanox!), qla2x00t for QLogic Fibre Channel adapters, ib_srpt for
InfiniBand SRP,
fcst for FCoE and scst_local for local loopback-like access as well
as SCST management
utility scstadmin. Also separately you can download from Emulex
development portal
stable and fully functional target driver for the current generation
of Emulex Fibre
Channel adapters.

SCST is alternative SCSI target stack for Linux. SCST allows
creation of sophisticated
storage devices, which provide advanced functionality, like
replication, thin
provisioning, deduplication, high availability, automatic backup,
etc. Majority of
recently developed SAN appliances, especially higher end ones, are
SCST based. It might
well be that your favorite storage appliance running SCST in the
firmware.

More info about SCST and its modules you can find on:
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.0 released

2014-09-20 Thread Vladislav Bolkhovitin

Hi All,

I'm glad to announce that SCST 3.0 has just been released. This release includes SCST 
core, target drivers iSCSI-SCST for iSCSI, including iSER support (thanks to 
Mellanox!), qla2x00t for QLogic Fibre Channel adapters, ib_srpt for InfiniBand SRP, 
fcst for FCoE and scst_local for local loopback-like access as well as SCST management 
utility scstadmin. Also separately you can download from Emulex development portal 
stable and fully functional target driver for the current generation of Emulex Fibre 
Channel adapters.


SCST is alternative SCSI target stack for Linux. SCST allows creation of sophisticated 
storage devices, which provide advanced functionality, like replication, thin 
provisioning, deduplication, high availability, automatic backup, etc. Majority of 
recently developed SAN appliances, especially higher end ones, are SCST based. It might 
well be that your favorite storage appliance running SCST in the firmware.


More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 3.0 released

2014-09-20 Thread Vladislav Bolkhovitin

Hi All,

I'm glad to announce that SCST 3.0 has just been released. This release includes SCST 
core, target drivers iSCSI-SCST for iSCSI, including iSER support (thanks to 
Mellanox!), qla2x00t for QLogic Fibre Channel adapters, ib_srpt for InfiniBand SRP, 
fcst for FCoE and scst_local for local loopback-like access as well as SCST management 
utility scstadmin. Also separately you can download from Emulex development portal 
stable and fully functional target driver for the current generation of Emulex Fibre 
Channel adapters.


SCST is alternative SCSI target stack for Linux. SCST allows creation of sophisticated 
storage devices, which provide advanced functionality, like replication, thin 
provisioning, deduplication, high availability, automatic backup, etc. Majority of 
recently developed SAN appliances, especially higher end ones, are SCST based. It might 
well be that your favorite storage appliance running SCST in the firmware.


More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 2.2 pre-release freeze

2014-05-21 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.0 pre-release code freeze in the SCST SVN branch 
3.0.x

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.0.x

It is going to be released after few weeks of testing, if nothing bad found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Majority of recently developed SAN appliances, especially higher 
end ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST 2.2 pre-release freeze

2014-05-21 Thread Vladislav Bolkhovitin
Hi All,

I'm glad to announce SCST 3.0 pre-release code freeze in the SCST SVN branch 
3.0.x

You can get it by command:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst/branches/3.0.x

It is going to be released after few weeks of testing, if nothing bad found.

SCST is alternative SCSI target stack for Linux. SCST allows creation of 
sophisticated storage devices, which provide advanced functionality, like 
replication, thin provisioning, deduplication, high availability, automatic 
backup, etc. Majority of recently developed SAN appliances, especially higher 
end ones, are SCST based. It might well be that your favorite storage appliance 
running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST iSER target driver is available for testing

2014-01-29 Thread Vladislav Bolkhovitin
I'm glad to announce that SCST iSER target driver is available for testing from 
the SCST SVN iser branch. You can download it either by command:

$ svn checkout svn://svn.code.sf.net/p/scst/svn/branches/iser iser-scst-branch

or by clicking on "Download Snapshot" button on 
http://sourceforge.net/p/scst/svn/HEAD/tree/branches/iser page.

Big thanks to Yan Burman and Mellanox Technologies who developed it!

SCST is SCSI target mode stack for Linux. SCST allows creation of sophisticated 
storage devices, which provide advanced functionality, like replication, thin 
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of recently developed SAN appliances, especially higher end ones, are SCST 
based. It might well be that your favorite storage appliance running SCST in 
the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST iSER target driver is available for testing

2014-01-29 Thread Vladislav Bolkhovitin
I'm glad to announce that SCST iSER target driver is available for testing from 
the SCST SVN iser branch. You can download it either by command:

$ svn checkout svn://svn.code.sf.net/p/scst/svn/branches/iser iser-scst-branch

or by clicking on Download Snapshot button on 
http://sourceforge.net/p/scst/svn/HEAD/tree/branches/iser page.

Big thanks to Yan Burman and Mellanox Technologies who developed it!

SCST is SCSI target mode stack for Linux. SCST allows creation of sophisticated 
storage devices, which provide advanced functionality, like replication, thin 
provisioning, deduplication, high availability, automatic backup, etc. Majority 
of recently developed SAN appliances, especially higher end ones, are SCST 
based. It might well be that your favorite storage appliance running SCST in 
the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-28 Thread Vladislav Bolkhovitin

Zuckerman, Boris, on 09/26/2013 12:36 PM wrote:
> I assume that we may have both: CPUs that may have ability to support 
> multiple transactions, CPUs that support only one, CPUs that support none (as 
> today), as well as different devices - transaction capable and not.
> So, it seems there is a room for compilers to do their work and for class 
> drivers to do their, right?

Yes, correct.

Conceptually NVDIMMs are not block devices. They may be used as block devices, 
but may
not be as well. So, nailing them into the block abstraction by big hammer is 
simply a
bad design.

Vlad

> boris
> 
>> -Original Message-
>> From: Matthew Wilcox [mailto:wi...@linux.intel.com]
>> Sent: Thursday, September 26, 2013 1:56 PM
>> To: Zuckerman, Boris
>> Cc: Vladislav Bolkhovitin; rob.gitt...@linux.intel.com; 
>> linux-p...@lists.infradead.org;
>> linux-fsde...@veger.org; linux-kernel@vger.kernel.org
>> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>>
>> On Thu, Sep 26, 2013 at 02:56:17PM +, Zuckerman, Boris wrote:
>>> To work with persistent memory as efficiently as we can work with RAM we 
>>> need a
>> bit more than "commit". It's reasonable to expect that we get some additional
>> support from CPUs that goes beyond mfence and mflush. That may include 
>> discovery,
>> transactional support, etc. Encapsulating that in a special class sooner 
>> than later
>> seams a right thing to do...
>>
>> If it's something CPU-specific, then we wouldn't handle it as part of the 
>> "class", we'd
>> handle it as an architecture abstraction.  It's only operations which are 
>> device-specific
>> which would need to be exposed through an operations vector.  For example, 
>> suppose
>> you buy one device from IBM and another device from HP, and plug them both 
>> into
>> your SPARC system.  The code you compile needs to run on SPARC, doing 
>> whatever
>> CPU operations are supported, but if HP and IBM have different ways of 
>> handling a
>> "commit" operation, we need that operation to be part of an operations 
>> vector.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-28 Thread Vladislav Bolkhovitin

Zuckerman, Boris, on 09/26/2013 12:36 PM wrote:
 I assume that we may have both: CPUs that may have ability to support 
 multiple transactions, CPUs that support only one, CPUs that support none (as 
 today), as well as different devices - transaction capable and not.
 So, it seems there is a room for compilers to do their work and for class 
 drivers to do their, right?

Yes, correct.

Conceptually NVDIMMs are not block devices. They may be used as block devices, 
but may
not be as well. So, nailing them into the block abstraction by big hammer is 
simply a
bad design.

Vlad

 boris
 
 -Original Message-
 From: Matthew Wilcox [mailto:wi...@linux.intel.com]
 Sent: Thursday, September 26, 2013 1:56 PM
 To: Zuckerman, Boris
 Cc: Vladislav Bolkhovitin; rob.gitt...@linux.intel.com; 
 linux-p...@lists.infradead.org;
 linux-fsde...@veger.org; linux-kernel@vger.kernel.org
 Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

 On Thu, Sep 26, 2013 at 02:56:17PM +, Zuckerman, Boris wrote:
 To work with persistent memory as efficiently as we can work with RAM we 
 need a
 bit more than commit. It's reasonable to expect that we get some additional
 support from CPUs that goes beyond mfence and mflush. That may include 
 discovery,
 transactional support, etc. Encapsulating that in a special class sooner 
 than later
 seams a right thing to do...

 If it's something CPU-specific, then we wouldn't handle it as part of the 
 class, we'd
 handle it as an architecture abstraction.  It's only operations which are 
 device-specific
 which would need to be exposed through an operations vector.  For example, 
 suppose
 you buy one device from IBM and another device from HP, and plug them both 
 into
 your SPARC system.  The code you compile needs to run on SPARC, doing 
 whatever
 CPU operations are supported, but if HP and IBM have different ways of 
 handling a
 commit operation, we need that operation to be part of an operations 
 vector.
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-26 Thread Vladislav Bolkhovitin
Hi Rob,

Rob Gittins, on 09/23/2013 03:51 PM wrote:
> On Fri, 2013-09-06 at 22:12 -0700, Vladislav Bolkhovitin wrote:
>> Rob Gittins, on 09/04/2013 02:54 PM wrote:
>>> Non-volatile DIMMs have started to become available.  A NVDIMMs is a
>>> DIMM that does not lose data across power interruptions.  Some of the
>>> NVDIMMs act like memory, while others are more like a block device
>>> on the memory bus. Application uses vary from being used to cache
>>> critical data, to being a boot device.
>>>
>>> There are two access classes of NVDIMMs,  block mode and
>>> “load/store” mode DIMMs which are referred to as Direct Memory
>>> Mappable.
>>>
>>> The block mode is where the DIMM provides IO ports for read or write
>>> of data.  These DIMMs reside on the memory bus but do not appear in the
>>> application address space.  Block mode DIMMs do not require any changes
>>> to the current infrastructure, since they provide IO type of interface.
>>>
>>> Direct Memory Mappable DIMMs (DMMD) appear in the system address space
>>> and are accessed via load and store instructions.  These NVDIMMs
>>> are part of the system physical address space (SPA) as memory with
>>> the attribute that data survives a power interruption.  As such this
>>> memory is managed by the kernel which can  assign virtual addresses and
>>> mapped into application’s address space as well as being accessible
>>> by the kernel.  The area mapped into the system address space is
>>> being referred to as persistent memory (PMEM).
>>>
>>> PMEM introduces the need for new operations in the
>>> block_device_operations to support the specific characteristics of
>>> the media.
>>>
>>> First data may not propagate all the way through the memory pipeline
>>> when store instructions are executed.  Data may stay in the CPU cache
>>> or in other buffers in the processor and memory complex.  In order to
>>> ensure the durability of data there needs to be a driver entry point
>>> to force a byte range out to media.  The methods of doing this are
>>> specific to the PMEM technology and need to be handled by the driver
>>> that is supporting the DMMDs.  To provide a way to ensure that data is
>>> durable adding a commit function to the block_device_operations vector.
>>>
>>>void (*commitpmem)(struct block_device *bdev, void *addr);
>>
>> Why to glue to the block concept for apparently not block class of devices? 
>> By pushing
>> NVDIMMs into the block model you both limiting them to block devices 
>> capabilities as
>> well as have to expand block devices by alien to them properties
> Hi Vlad,
> 
> We chose to extent the block operations for a couple of reasons.  The
> majority of NVDIMM usage is by emulating block mode.  We figure that
> over time usages will appear that use them directly and then we can
> design interfaces to enable direct use.  
> 
> Since a range of NVDIMM needs a name, security and other attributes mmap
> is a really good model to build on.  This quickly takes us into the
> realm of a file systems, which are easiest to build on the existing
> block infrastructure.  
> 
> Another reason to extend block is that all of the existing
> administrative interfaces and tools such as mkfs still work and we have
> not added some new management tools and requirements that may inhibit
> the adoption of the technology.  Basically if it works today for block
> the same cli commands will work for NVDIMMs.
> 
> The extensions are so minimal that they don't negatively impact the
> existing interfaces.

Well, they will negatively impact them, because those NVDIMM additions are 
conceptually
alien for the block devices concept.

You didn't answer, why not create a new class of devices for NVDIMM devices, and
implement one-fit-all block driver for them? Simple, clean and elegant 
solution, which
will fit your need to have block device from NVDIMM device pretty well with 
minimal effort.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-26 Thread Vladislav Bolkhovitin
Hi Rob,

Rob Gittins, on 09/23/2013 03:51 PM wrote:
 On Fri, 2013-09-06 at 22:12 -0700, Vladislav Bolkhovitin wrote:
 Rob Gittins, on 09/04/2013 02:54 PM wrote:
 Non-volatile DIMMs have started to become available.  A NVDIMMs is a
 DIMM that does not lose data across power interruptions.  Some of the
 NVDIMMs act like memory, while others are more like a block device
 on the memory bus. Application uses vary from being used to cache
 critical data, to being a boot device.

 There are two access classes of NVDIMMs,  block mode and
 “load/store” mode DIMMs which are referred to as Direct Memory
 Mappable.

 The block mode is where the DIMM provides IO ports for read or write
 of data.  These DIMMs reside on the memory bus but do not appear in the
 application address space.  Block mode DIMMs do not require any changes
 to the current infrastructure, since they provide IO type of interface.

 Direct Memory Mappable DIMMs (DMMD) appear in the system address space
 and are accessed via load and store instructions.  These NVDIMMs
 are part of the system physical address space (SPA) as memory with
 the attribute that data survives a power interruption.  As such this
 memory is managed by the kernel which can  assign virtual addresses and
 mapped into application’s address space as well as being accessible
 by the kernel.  The area mapped into the system address space is
 being referred to as persistent memory (PMEM).

 PMEM introduces the need for new operations in the
 block_device_operations to support the specific characteristics of
 the media.

 First data may not propagate all the way through the memory pipeline
 when store instructions are executed.  Data may stay in the CPU cache
 or in other buffers in the processor and memory complex.  In order to
 ensure the durability of data there needs to be a driver entry point
 to force a byte range out to media.  The methods of doing this are
 specific to the PMEM technology and need to be handled by the driver
 that is supporting the DMMDs.  To provide a way to ensure that data is
 durable adding a commit function to the block_device_operations vector.

void (*commitpmem)(struct block_device *bdev, void *addr);

 Why to glue to the block concept for apparently not block class of devices? 
 By pushing
 NVDIMMs into the block model you both limiting them to block devices 
 capabilities as
 well as have to expand block devices by alien to them properties
 Hi Vlad,
 
 We chose to extent the block operations for a couple of reasons.  The
 majority of NVDIMM usage is by emulating block mode.  We figure that
 over time usages will appear that use them directly and then we can
 design interfaces to enable direct use.  
 
 Since a range of NVDIMM needs a name, security and other attributes mmap
 is a really good model to build on.  This quickly takes us into the
 realm of a file systems, which are easiest to build on the existing
 block infrastructure.  
 
 Another reason to extend block is that all of the existing
 administrative interfaces and tools such as mkfs still work and we have
 not added some new management tools and requirements that may inhibit
 the adoption of the technology.  Basically if it works today for block
 the same cli commands will work for NVDIMMs.
 
 The extensions are so minimal that they don't negatively impact the
 existing interfaces.

Well, they will negatively impact them, because those NVDIMM additions are 
conceptually
alien for the block devices concept.

You didn't answer, why not create a new class of devices for NVDIMM devices, and
implement one-fit-all block driver for them? Simple, clean and elegant 
solution, which
will fit your need to have block device from NVDIMM device pretty well with 
minimal effort.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-06 Thread Vladislav Bolkhovitin

Rob Gittins, on 09/04/2013 02:54 PM wrote:
> Non-volatile DIMMs have started to become available.  A NVDIMMs is a
> DIMM that does not lose data across power interruptions.  Some of the
> NVDIMMs act like memory, while others are more like a block device
> on the memory bus. Application uses vary from being used to cache
> critical data, to being a boot device.
> 
> There are two access classes of NVDIMMs,  block mode and
> “load/store” mode DIMMs which are referred to as Direct Memory
> Mappable.
> 
> The block mode is where the DIMM provides IO ports for read or write
> of data.  These DIMMs reside on the memory bus but do not appear in the
> application address space.  Block mode DIMMs do not require any changes
> to the current infrastructure, since they provide IO type of interface.
> 
> Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> and are accessed via load and store instructions.  These NVDIMMs
> are part of the system physical address space (SPA) as memory with
> the attribute that data survives a power interruption.  As such this
> memory is managed by the kernel which can  assign virtual addresses and
> mapped into application’s address space as well as being accessible
> by the kernel.  The area mapped into the system address space is
> being referred to as persistent memory (PMEM).
> 
> PMEM introduces the need for new operations in the
> block_device_operations to support the specific characteristics of
> the media.
> 
> First data may not propagate all the way through the memory pipeline
> when store instructions are executed.  Data may stay in the CPU cache
> or in other buffers in the processor and memory complex.  In order to
> ensure the durability of data there needs to be a driver entry point
> to force a byte range out to media.  The methods of doing this are
> specific to the PMEM technology and need to be handled by the driver
> that is supporting the DMMDs.  To provide a way to ensure that data is
> durable adding a commit function to the block_device_operations vector.
> 
>void (*commitpmem)(struct block_device *bdev, void *addr);

Why to glue to the block concept for apparently not block class of devices? By 
pushing
NVDIMMs into the block model you both limiting them to block devices 
capabilities as
well as have to expand block devices by alien to them properties.

NVDIMMs are, apparently, a new class of devices, so better to have a new class 
of
kernel devices for them. If you then need to put file systems on top of them, 
just
write one-fit-all blk_nvmem driver, which can create a block device for all 
types of
NVDIMM devices and drivers.

This way you will clearly and gracefully get the best from NVDIMM devices as 
well as
won't soil block devices.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC Block Layer Extensions to Support NV-DIMMs

2013-09-06 Thread Vladislav Bolkhovitin

Rob Gittins, on 09/04/2013 02:54 PM wrote:
 Non-volatile DIMMs have started to become available.  A NVDIMMs is a
 DIMM that does not lose data across power interruptions.  Some of the
 NVDIMMs act like memory, while others are more like a block device
 on the memory bus. Application uses vary from being used to cache
 critical data, to being a boot device.
 
 There are two access classes of NVDIMMs,  block mode and
 “load/store” mode DIMMs which are referred to as Direct Memory
 Mappable.
 
 The block mode is where the DIMM provides IO ports for read or write
 of data.  These DIMMs reside on the memory bus but do not appear in the
 application address space.  Block mode DIMMs do not require any changes
 to the current infrastructure, since they provide IO type of interface.
 
 Direct Memory Mappable DIMMs (DMMD) appear in the system address space
 and are accessed via load and store instructions.  These NVDIMMs
 are part of the system physical address space (SPA) as memory with
 the attribute that data survives a power interruption.  As such this
 memory is managed by the kernel which can  assign virtual addresses and
 mapped into application’s address space as well as being accessible
 by the kernel.  The area mapped into the system address space is
 being referred to as persistent memory (PMEM).
 
 PMEM introduces the need for new operations in the
 block_device_operations to support the specific characteristics of
 the media.
 
 First data may not propagate all the way through the memory pipeline
 when store instructions are executed.  Data may stay in the CPU cache
 or in other buffers in the processor and memory complex.  In order to
 ensure the durability of data there needs to be a driver entry point
 to force a byte range out to media.  The methods of doing this are
 specific to the PMEM technology and need to be handled by the driver
 that is supporting the DMMDs.  To provide a way to ensure that data is
 durable adding a commit function to the block_device_operations vector.
 
void (*commitpmem)(struct block_device *bdev, void *addr);

Why to glue to the block concept for apparently not block class of devices? By 
pushing
NVDIMMs into the block model you both limiting them to block devices 
capabilities as
well as have to expand block devices by alien to them properties.

NVDIMMs are, apparently, a new class of devices, so better to have a new class 
of
kernel devices for them. If you then need to put file systems on top of them, 
just
write one-fit-all blk_nvmem driver, which can create a block device for all 
types of
NVDIMM devices and drivers.

This way you will clearly and gracefully get the best from NVDIMM devices as 
well as
won't soil block devices.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: Emulex SCST support for 16Gb/s FC and FCoE CNAs

2013-09-04 Thread Vladislav Bolkhovitin
I'm glad to announce that SCST support for 16Gb/s FC and FCoE Emulex CNAs is now
available as part of the Emulex OneCore Storage SDK tool set based on the 
Emulex SLI-4
API. Support for 16Gb/s Fibre Channel LPe16000 series and FCoE hardware using 
target
mode versions of the OneConnect FCoE CNAs is included. Documented for use with
RHEL/CentOS 6.x based distributions, ocs_fc_scst works with the stable SCST 
2.2.1 as
well as the development versions of 2.2.x and 3.0.x. The driver code and 
documentation
are available on the Emulex web site at:
http://www.emulex.com/products/onecore-storage-software-development-kit/overview.html

Registration is required on the Developer Portal, but this is free.

Questions regarding this driver better to ask via the Developer Portal.

SCST is SCSI target mode stack for Linux. SCST allows creation of sophisticated 
storage
devices, which provide advanced functionality, like replication, thin 
provisioning,
deduplication, high availability, automatic backup, etc. Majority of recently 
developed
SAN appliances, especially higher end ones, are SCST based. It might well be 
that your
favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: Emulex SCST support for 16Gb/s FC and FCoE CNAs

2013-09-04 Thread Vladislav Bolkhovitin
I'm glad to announce that SCST support for 16Gb/s FC and FCoE Emulex CNAs is now
available as part of the Emulex OneCore Storage SDK tool set based on the 
Emulex SLI-4
API. Support for 16Gb/s Fibre Channel LPe16000 series and FCoE hardware using 
target
mode versions of the OneConnect FCoE CNAs is included. Documented for use with
RHEL/CentOS 6.x based distributions, ocs_fc_scst works with the stable SCST 
2.2.1 as
well as the development versions of 2.2.x and 3.0.x. The driver code and 
documentation
are available on the Emulex web site at:
http://www.emulex.com/products/onecore-storage-software-development-kit/overview.html

Registration is required on the Developer Portal, but this is free.

Questions regarding this driver better to ask via the Developer Portal.

SCST is SCSI target mode stack for Linux. SCST allows creation of sophisticated 
storage
devices, which provide advanced functionality, like replication, thin 
provisioning,
deduplication, high availability, automatic backup, etc. Majority of recently 
developed
SAN appliances, especially higher end ones, are SCST based. It might well be 
that your
favorite storage appliance running SCST in the firmware.

More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))

2013-05-29 Thread Vladislav Bolkhovitin
Martin K. Petersen, on 05/28/2013 01:25 PM wrote:
> Vladislav> Linux block layer is purely artificial creature slowly
> Vladislav> reinventing wheel creating more problems, than solving.
> 
> On the contrary. I do think we solve a whole bunch of problems.
> 
> 
> Vladislav> It enforces approach, where often "impossible" means
> Vladislav> "impossible in this interface".
> 
> I agree we have limitations. I do not agree that all limitations are
> bad. Sometimes it's OK to say no.
> 
> 
> Vladislav> For instance, how about copy offload?  How about atomic
> Vladislav> writes?
> 
> I'm actively working on copy offload. Nobody appears to be interested in
> atomic writes. Otherwise I'd work on those as well.
> 
> 
> Vladislav> Why was it needed to create special blk integrity interface
> Vladislav> with the only end user - SCSI?
> 
> Simple. Because we did not want to interleave data and PI 512+8+512+8
> neither in memory, nor at DMA time.

It can similarly be done in SCSI-like interface without need for any middleman.

> Furthermore, the ATA EPP proposal
> was still on the table so I also needed to support ATA.
> 
> And finally, NVM Express uses the blk_integrity interface as well.
> 
> 
> Vladislav> The block layer keeps repeating SCSI. So, maybe, after all,
> Vladislav> it's better to acknowledge that direct usage of SCSI without
> Vladislav> any intermediate layers and translations is more productive?
> Vladislav> And for those minors not using SCSI internally, translate
> Vladislav> from SCSI to their internal commands? Creating and filling
> Vladislav> CDB fields for most cases isn't anyhow harder, than creating
> Vladislav> and feeling bio fields.
> 
> This is quite possibly the worst idea I have heard all week.
> 
> As it stands it's a headache for the disk ULD driver to figure out which
> of the bazillion READ/WRITE variants to send to a SCSI/ATA device. What
> makes you think that an application or filesystem would be better
> equipped to make that call?
> 
> See also: WRITE SAME w/ zeroes vs. WRITE SAME w/ UNMAP vs. UNMAP 
> 
> See also: EXTENDED COPY vs. the PROXY command set
> 
> See also: USB-ATA bridge chips
> 
> You make it sound like all the block layer does is filling out
> CDBs. Which it doesn't in fact have anything to do with at all.
> 
> When you are talking about CDBs we're down in the SBC/SSC territory.
> Which is such a tiny bit of what's going on. We have transports, we have
> SAM, we have HBA controller DMA constraints, system DMA constraints,
> buffer bouncing, etc. There's a ton of stuff that needs to happen before
> the CDB and the data physically reach the storage.
> 
> You seem to be advocating that everything up to the point where the
> device receives the command is in the way. Well, by all means. Why limit
> ourselves to the confines of SCSI? Why not get rid of POSIX
> read()/write(), page cache, filesystems and let applications speak
> ST-506 directly?
> 
> I know we're doing different things. My job is to make a general purpose
> operating system with interfaces that make sense to normal applications.
> That does not preclude special cases where it may make sense to poke at
> the device directly. For testing purposes, for instance. But I consider
> it a failure when we start having applications that know about hardware
> intricacies, cylinders/heads/sectors, etc. That road leads straight to
> the 1980s...

What you mean is true, but my point is that this abstraction is better to be 
done in
SCSI, i.e. SAM, manner. Now need to write fields inside of CDBs, it would be 
pretty
inconvenient ;). But CDBs fields can be fields in some scsi_io structure. Exact 
opcodes
can be easily abstracted to be filled on the last stage, where end CDB is 
constructed
from those fields.

Problem with block abstraction is that it is the least common denominator of 
all block
devices capabilities, hence advanced capabilities, available only some class of
devices, are automatically become "impossible". Hence, it would be more 
productive
instead to use the most capable abstraction, which is SAM. In this abstraction 
there's
no need to reinvent complex interfaces and write complex middleman code for 
every
advanced capability. All advanced capabilities there are available by 
definition, if
supported by underlying hardware. That's my point.

POSIX is for simple applications, for which read()/write() calls are 
sufficient. They
are outside of our discussions. But advanced applications need more. I know 
plenty of
applications issuing direct SCSI commands, but how many can you name 
applications using
block interface (bsg)? I can recall only one quite relatively used Linux 
specific
library. That's all. This interface is not demanded by applications.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))

2013-05-29 Thread Vladislav Bolkhovitin
Martin K. Petersen, on 05/28/2013 01:25 PM wrote:
 Vladislav Linux block layer is purely artificial creature slowly
 Vladislav reinventing wheel creating more problems, than solving.
 
 On the contrary. I do think we solve a whole bunch of problems.
 
 
 Vladislav It enforces approach, where often impossible means
 Vladislav impossible in this interface.
 
 I agree we have limitations. I do not agree that all limitations are
 bad. Sometimes it's OK to say no.
 
 
 Vladislav For instance, how about copy offload?  How about atomic
 Vladislav writes?
 
 I'm actively working on copy offload. Nobody appears to be interested in
 atomic writes. Otherwise I'd work on those as well.
 
 
 Vladislav Why was it needed to create special blk integrity interface
 Vladislav with the only end user - SCSI?
 
 Simple. Because we did not want to interleave data and PI 512+8+512+8
 neither in memory, nor at DMA time.

It can similarly be done in SCSI-like interface without need for any middleman.

 Furthermore, the ATA EPP proposal
 was still on the table so I also needed to support ATA.
 
 And finally, NVM Express uses the blk_integrity interface as well.
 
 
 Vladislav The block layer keeps repeating SCSI. So, maybe, after all,
 Vladislav it's better to acknowledge that direct usage of SCSI without
 Vladislav any intermediate layers and translations is more productive?
 Vladislav And for those minors not using SCSI internally, translate
 Vladislav from SCSI to their internal commands? Creating and filling
 Vladislav CDB fields for most cases isn't anyhow harder, than creating
 Vladislav and feeling bio fields.
 
 This is quite possibly the worst idea I have heard all week.
 
 As it stands it's a headache for the disk ULD driver to figure out which
 of the bazillion READ/WRITE variants to send to a SCSI/ATA device. What
 makes you think that an application or filesystem would be better
 equipped to make that call?
 
 See also: WRITE SAME w/ zeroes vs. WRITE SAME w/ UNMAP vs. UNMAP 
 
 See also: EXTENDED COPY vs. the PROXY command set
 
 See also: USB-ATA bridge chips
 
 You make it sound like all the block layer does is filling out
 CDBs. Which it doesn't in fact have anything to do with at all.
 
 When you are talking about CDBs we're down in the SBC/SSC territory.
 Which is such a tiny bit of what's going on. We have transports, we have
 SAM, we have HBA controller DMA constraints, system DMA constraints,
 buffer bouncing, etc. There's a ton of stuff that needs to happen before
 the CDB and the data physically reach the storage.
 
 You seem to be advocating that everything up to the point where the
 device receives the command is in the way. Well, by all means. Why limit
 ourselves to the confines of SCSI? Why not get rid of POSIX
 read()/write(), page cache, filesystems and let applications speak
 ST-506 directly?
 
 I know we're doing different things. My job is to make a general purpose
 operating system with interfaces that make sense to normal applications.
 That does not preclude special cases where it may make sense to poke at
 the device directly. For testing purposes, for instance. But I consider
 it a failure when we start having applications that know about hardware
 intricacies, cylinders/heads/sectors, etc. That road leads straight to
 the 1980s...

What you mean is true, but my point is that this abstraction is better to be 
done in
SCSI, i.e. SAM, manner. Now need to write fields inside of CDBs, it would be 
pretty
inconvenient ;). But CDBs fields can be fields in some scsi_io structure. Exact 
opcodes
can be easily abstracted to be filled on the last stage, where end CDB is 
constructed
from those fields.

Problem with block abstraction is that it is the least common denominator of 
all block
devices capabilities, hence advanced capabilities, available only some class of
devices, are automatically become impossible. Hence, it would be more 
productive
instead to use the most capable abstraction, which is SAM. In this abstraction 
there's
no need to reinvent complex interfaces and write complex middleman code for 
every
advanced capability. All advanced capabilities there are available by 
definition, if
supported by underlying hardware. That's my point.

POSIX is for simple applications, for which read()/write() calls are 
sufficient. They
are outside of our discussions. But advanced applications need more. I know 
plenty of
applications issuing direct SCSI commands, but how many can you name 
applications using
block interface (bsg)? I can recall only one quite relatively used Linux 
specific
library. That's all. This interface is not demanded by applications.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))

2013-05-24 Thread Vladislav Bolkhovitin
Martin K. Petersen, on 05/22/2013 09:32 AM wrote:
> Paolo> First of all, I'll note that SG_IO and block-device-specific
> Paolo> ioctls both have their place.  My usecase for SG_IO is
> Paolo> virtualization, where I need to pass information from the LUN to
> Paolo> the virtual machine with as much fidelity as possible if I choose
> Paolo> to virtualize at the SCSI level.  
> 
> Now there's your problem! Several people told you way back that the SCSI
> virt approach was a really poor choice. The SG_IO permissions problem is
> a classic "Doctor, it hurts when I do this".
> 
> The kernel's fundamental task is to provide abstraction between
> applications and intricacies of hardware. The right way to solve the
> problem would have been to provide a better device abstraction built on
> top of the block/SCSI infrastructure we already have in place. If you
> need more fidelity, add fidelity to the block layer instead of punching
> a giant hole through it.
> 
> I seem to recall that reservations were part of your motivation for
> going the SCSI route in the first place. A better approach would have
> been to create a generic reservations mechanism that could be exposed to
> the guest. And then let the baremetal kernel worry about the appropriate
> way to communicate with the physical hardware. Just like we've done with
> reads and writes, discard, write same, etc.

Well, any abstraction is good only if it isn't artificial, so solving more 
problems
than creating.

Reality is that de facto in the industry _SCSI_ is the abstraction for 
block/direct
access to data. Look around. How many of systems around you after all layers 
end up to
SCSI commands in their storage devices?

Linux block layer is purely artificial creature slowly reinventing wheel 
creating more
problems, than solving. It enforces approach, where often "impossible" means
"impossible in this interface". For instance, how about copy offload? How about
reservations? How about atomic writes? Look at history of barriers and compare 
then
with what can be done in SCSI. It's still worse, because doesn't allow usage of 
all
devices capabilities. Why was it needed to create special blk integrity 
interface with
the only end user - SCSI? Artificial task created - then well solved. Etc, etc.

The block layer keeps repeating SCSI. So, maybe, after all, it's better to 
acknowledge
that direct usage of SCSI without any intermediate layers and translations is 
more
productive? And for those minors not using SCSI internally, translate from SCSI 
to
their internal commands? Creating and filling CDB fields for most cases isn't 
anyhow
harder, than creating and feeling bio fields.

So, I appreciate work Paolo is doing in this direction. At least, the right 
thing will
be on the virtualization level.

I do understand that with all existing baggage replacing block layer by SCSI 
isn't
practical and not proposing it, but let's at least acknowledge limitations of 
the
academic block abstraction. Let's don't make those limitations global walls. 
Many
things better to do using direct SCSI, hence let's do the better way.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))

2013-05-24 Thread Vladislav Bolkhovitin
Martin K. Petersen, on 05/22/2013 09:32 AM wrote:
 Paolo First of all, I'll note that SG_IO and block-device-specific
 Paolo ioctls both have their place.  My usecase for SG_IO is
 Paolo virtualization, where I need to pass information from the LUN to
 Paolo the virtual machine with as much fidelity as possible if I choose
 Paolo to virtualize at the SCSI level.  
 
 Now there's your problem! Several people told you way back that the SCSI
 virt approach was a really poor choice. The SG_IO permissions problem is
 a classic Doctor, it hurts when I do this.
 
 The kernel's fundamental task is to provide abstraction between
 applications and intricacies of hardware. The right way to solve the
 problem would have been to provide a better device abstraction built on
 top of the block/SCSI infrastructure we already have in place. If you
 need more fidelity, add fidelity to the block layer instead of punching
 a giant hole through it.
 
 I seem to recall that reservations were part of your motivation for
 going the SCSI route in the first place. A better approach would have
 been to create a generic reservations mechanism that could be exposed to
 the guest. And then let the baremetal kernel worry about the appropriate
 way to communicate with the physical hardware. Just like we've done with
 reads and writes, discard, write same, etc.

Well, any abstraction is good only if it isn't artificial, so solving more 
problems
than creating.

Reality is that de facto in the industry _SCSI_ is the abstraction for 
block/direct
access to data. Look around. How many of systems around you after all layers 
end up to
SCSI commands in their storage devices?

Linux block layer is purely artificial creature slowly reinventing wheel 
creating more
problems, than solving. It enforces approach, where often impossible means
impossible in this interface. For instance, how about copy offload? How about
reservations? How about atomic writes? Look at history of barriers and compare 
then
with what can be done in SCSI. It's still worse, because doesn't allow usage of 
all
devices capabilities. Why was it needed to create special blk integrity 
interface with
the only end user - SCSI? Artificial task created - then well solved. Etc, etc.

The block layer keeps repeating SCSI. So, maybe, after all, it's better to 
acknowledge
that direct usage of SCSI without any intermediate layers and translations is 
more
productive? And for those minors not using SCSI internally, translate from SCSI 
to
their internal commands? Creating and filling CDB fields for most cases isn't 
anyhow
harder, than creating and feeling bio fields.

So, I appreciate work Paolo is doing in this direction. At least, the right 
thing will
be on the virtualization level.

I do understand that with all existing baggage replacing block layer by SCSI 
isn't
practical and not proposing it, but let's at least acknowledge limitations of 
the
academic block abstraction. Let's don't make those limitations global walls. 
Many
things better to do using direct SCSI, hence let's do the better way.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


WARNING: at lib/dma-debug.c:937 check_unmap+0x45f/0x8b0() (ioat_dma_self_test [ioatdma])

2013-05-13 Thread Vladislav Bolkhovitin
Hello,

I keep getting on each reboot of my kernel 3.9.1 debug system:

[   42.037225] [ cut here ]
[   42.037237] WARNING: at lib/dma-debug.c:937 check_unmap+0x45f/0x8b0()
[   42.037240] Hardware name: PowerEdge R710
[   42.037243] ioatdma :00:16.0: DMA-API: device driver failed to check map 
error[device address=0x0001268fc0f8] [size=2000 bytes] [mapped as single]
[   42.037245] Modules linked in: lpc_ich(+) ehci_pci(+) mfd_core ehci_hcd 
ioatdma(+) dca i7core_edac acpi_power_meter hwmon evdev processor bnx2 
firmware_class loop sg ata_generic pata_acpi usbhid hid raid10 raid456 
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 
multipath linear crc32c_intel ata_piix libata uhci_hcd mptsas mptscsih mptbase
[   42.037287] Pid: 4457, comm: modprobe Not tainted 3.9.1-scst-dbg #4
[   42.037289] Call Trace:
[   42.037297]  [] warn_slowpath_common+0x7f/0xc0
[   42.037302]  [] warn_slowpath_fmt+0x46/0x50
[   42.037306]  [] check_unmap+0x45f/0x8b0
[   42.037312]  [] ? __lock_release+0x106/0x170
[   42.037316]  [] debug_dma_unmap_page+0x5a/0x60
[   42.037326]  [] ioat_dma_self_test+0x387/0x600 [ioatdma]
[   42.037332]  [] ? devres_add+0x4f/0x70
[   42.037338]  [] ? wait_for_completion_timeout+0xf2/0x120
[   42.037346]  [] ioat3_dma_self_test+0x16/0x30 [ioatdma]
[   42.037352]  [] ioat_probe+0xe9/0x100 [ioatdma]
[   42.037359]  [] ioat3_dma_probe+0x176/0x2d0 [ioatdma]
[   42.037365]  [] ioat_pci_probe+0x1b1/0x1d0 [ioatdma]
[   42.037370]  [] local_pci_probe+0x23/0x40
[   42.037374]  [] __pci_device_probe+0xd9/0xe0
[   42.037377]  [] ? pci_dev_get+0x22/0x30
[   42.037381]  [] pci_device_probe+0x3a/0x60
[   42.037385]  [] really_probe+0x6c/0x320
[   42.037389]  [] driver_probe_device+0x3b/0x80
[   42.037392]  [] __driver_attach+0x9b/0xa0
[   42.037396]  [] ? driver_probe_device+0x80/0x80
[   42.037400]  [] ? driver_probe_device+0x80/0x80
[   42.037403]  [] bus_for_each_dev+0x98/0xc0
[   42.037407]  [] driver_attach+0x1e/0x20
[   42.037411]  [] bus_add_driver+0x208/0x290
[   42.037415]  [] ? 0xa01b
[   42.037418]  [] driver_register+0x78/0x160
[   42.037422]  [] ? 0xa01b
[   42.037426]  [] __pci_register_driver+0x64/0x70
[   42.037432]  [] ioat_init_module+0x6d/0x1000 [ioatdma]
[   42.037438]  [] do_one_initcall+0x42/0x170
[   42.037443]  [] do_init_module+0xaa/0x220
[   42.037447]  [] load_module+0x4ae/0x590
[   42.037452]  [] ? ddebug_dyndbg_boot_param_cb+0x60/0x60
[   42.037457]  [] ? copy_user_generic_string+0x30/0x40
[   42.037461]  [] ? module_sect_show+0x30/0x30
[   42.037464]  [] sys_init_module+0x94/0xc0
[   42.037470]  [] system_call_fastpath+0x16/0x1b
[   42.037473] ---[ end trace 4d301874fd4a843a ]---
[   42.037475] Mapped at:
[   42.037477]  [] debug_dma_map_page+0xbd/0x160
[   42.037480]  [] ioat_dma_self_test+0x195/0x600 [ioatdma]
[   42.037487]  [] ioat3_dma_self_test+0x16/0x30 [ioatdma]
[   42.037493]  [] ioat_probe+0xe9/0x100 [ioatdma]
[   42.037499]  [] ioat3_dma_probe+0x176/0x2d0 [ioatdma]

First time I've had it for 3.8. I see patch to fix it was sent some time ago 
for 3.9-rcX, but, apparently, it wasn't applied to the final 3.9.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


WARNING: at lib/dma-debug.c:937 check_unmap+0x45f/0x8b0() (ioat_dma_self_test [ioatdma])

2013-05-13 Thread Vladislav Bolkhovitin
Hello,

I keep getting on each reboot of my kernel 3.9.1 debug system:

[   42.037225] [ cut here ]
[   42.037237] WARNING: at lib/dma-debug.c:937 check_unmap+0x45f/0x8b0()
[   42.037240] Hardware name: PowerEdge R710
[   42.037243] ioatdma :00:16.0: DMA-API: device driver failed to check map 
error[device address=0x0001268fc0f8] [size=2000 bytes] [mapped as single]
[   42.037245] Modules linked in: lpc_ich(+) ehci_pci(+) mfd_core ehci_hcd 
ioatdma(+) dca i7core_edac acpi_power_meter hwmon evdev processor bnx2 
firmware_class loop sg ata_generic pata_acpi usbhid hid raid10 raid456 
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 
multipath linear crc32c_intel ata_piix libata uhci_hcd mptsas mptscsih mptbase
[   42.037287] Pid: 4457, comm: modprobe Not tainted 3.9.1-scst-dbg #4
[   42.037289] Call Trace:
[   42.037297]  [8103e88f] warn_slowpath_common+0x7f/0xc0
[   42.037302]  [8103e986] warn_slowpath_fmt+0x46/0x50
[   42.037306]  [812bea9f] check_unmap+0x45f/0x8b0
[   42.037312]  [81095ef6] ? __lock_release+0x106/0x170
[   42.037316]  [812bf15a] debug_dma_unmap_page+0x5a/0x60
[   42.037326]  [a01b12c7] ioat_dma_self_test+0x387/0x600 [ioatdma]
[   42.037332]  [81356a8f] ? devres_add+0x4f/0x70
[   42.037338]  [814cc5d2] ? wait_for_completion_timeout+0xf2/0x120
[   42.037346]  [a01b70b6] ioat3_dma_self_test+0x16/0x30 [ioatdma]
[   42.037352]  [a01afbc9] ioat_probe+0xe9/0x100 [ioatdma]
[   42.037359]  [a01b3eb6] ioat3_dma_probe+0x176/0x2d0 [ioatdma]
[   42.037365]  [a01af231] ioat_pci_probe+0x1b1/0x1d0 [ioatdma]
[   42.037370]  [812ccf03] local_pci_probe+0x23/0x40
[   42.037374]  [812cd329] __pci_device_probe+0xd9/0xe0
[   42.037377]  [812cd512] ? pci_dev_get+0x22/0x30
[   42.037381]  [812cd55a] pci_device_probe+0x3a/0x60
[   42.037385]  [813536ac] really_probe+0x6c/0x320
[   42.037389]  [8135399b] driver_probe_device+0x3b/0x80
[   42.037392]  [81353a7b] __driver_attach+0x9b/0xa0
[   42.037396]  [813539e0] ? driver_probe_device+0x80/0x80
[   42.037400]  [813539e0] ? driver_probe_device+0x80/0x80
[   42.037403]  [81351808] bus_for_each_dev+0x98/0xc0
[   42.037407]  [8135338e] driver_attach+0x1e/0x20
[   42.037411]  [81352d88] bus_add_driver+0x208/0x290
[   42.037415]  [a01c] ? 0xa01b
[   42.037418]  [81353de8] driver_register+0x78/0x160
[   42.037422]  [a01c] ? 0xa01b
[   42.037426]  [812cd664] __pci_register_driver+0x64/0x70
[   42.037432]  [a01c006d] ioat_init_module+0x6d/0x1000 [ioatdma]
[   42.037438]  [81000212] do_one_initcall+0x42/0x170
[   42.037443]  [810a425a] do_init_module+0xaa/0x220
[   42.037447]  [810a567e] load_module+0x4ae/0x590
[   42.037452]  [812bce00] ? ddebug_dyndbg_boot_param_cb+0x60/0x60
[   42.037457]  [8129bba0] ? copy_user_generic_string+0x30/0x40
[   42.037461]  [810a1570] ? module_sect_show+0x30/0x30
[   42.037464]  [810a58c4] sys_init_module+0x94/0xc0
[   42.037470]  [814d5f42] system_call_fastpath+0x16/0x1b
[   42.037473] ---[ end trace 4d301874fd4a843a ]---
[   42.037475] Mapped at:
[   42.037477]  [812bf4ad] debug_dma_map_page+0xbd/0x160
[   42.037480]  [a01b10d5] ioat_dma_self_test+0x195/0x600 [ioatdma]
[   42.037487]  [a01b70b6] ioat3_dma_self_test+0x16/0x30 [ioatdma]
[   42.037493]  [a01afbc9] ioat_probe+0xe9/0x100 [ioatdma]
[   42.037499]  [a01b3eb6] ioat3_dma_probe+0x176/0x2d0 [ioatdma]

First time I've had it for 3.8. I see patch to fix it was sent some time ago 
for 3.9-rcX, but, apparently, it wasn't applied to the final 3.9.

Thanks,
Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LIO - the broken iSCSI target implementation

2013-01-17 Thread Vladislav Bolkhovitin


Andreas Steinmetz, on 01/16/2013 08:19 PM wrote:

Thus, lio (http://www.linux-iscsi.org/) seemed to be the politically and
technically favoured solution.


[...]


The fun part of it was that I finally ended up using SCST - which was
refrained from kernel inclusion for technical reasons beyond my
knowledge.


No, it was purely political. There has never been any technical argument why LIO 
is better. And can not be.


Thanks,
Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LIO - the broken iSCSI target implementation

2013-01-17 Thread Vladislav Bolkhovitin


Andreas Steinmetz, on 01/16/2013 08:19 PM wrote:

Thus, lio (http://www.linux-iscsi.org/) seemed to be the politically and
technically favoured solution.


[...]


The fun part of it was that I finally ended up using SCST - which was
refrained from kernel inclusion for technical reasons beyond my
knowledge.


No, it was purely political. There has never been any technical argument why LIO 
is better. And can not be.


Thanks,
Vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST version 2.2.1 released

2013-01-15 Thread Vladislav Bolkhovitin
SCST version 2.2.1 has just been released. This release includes SCST core, target 
drivers iSCSI-SCST (iSCSI), qla2x00t (QLogic Fibre Channel), ib_srpt (InfiniBand 
SRP) and scst_local (local loopback-like access) as well as SCST management 
utility scstadmin.


SCST allows creation of sophisticated storage devices, which provide advanced 
functionality, like replication, thin provisioning, deduplication, high 
availability, automatic backup, etc. Another class of such devices are Virtual 
Tape Libraries (VTL) as well as other disk-based backup solutions. SCST created 
devices not limited by network interface only, but they can use any link 
supporting SCSI-style data exchange, like Fibre Channel or SAS. Majority of modern 
SAN appliances, especially higher end ones, are SCST based. It might well be that 
your favorite storage appliance running SCST in the firmware.


More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially Bart Van Assche, who prepared the 
packages!


Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE]: SCST version 2.2.1 released

2013-01-15 Thread Vladislav Bolkhovitin
SCST version 2.2.1 has just been released. This release includes SCST core, target 
drivers iSCSI-SCST (iSCSI), qla2x00t (QLogic Fibre Channel), ib_srpt (InfiniBand 
SRP) and scst_local (local loopback-like access) as well as SCST management 
utility scstadmin.


SCST allows creation of sophisticated storage devices, which provide advanced 
functionality, like replication, thin provisioning, deduplication, high 
availability, automatic backup, etc. Another class of such devices are Virtual 
Tape Libraries (VTL) as well as other disk-based backup solutions. SCST created 
devices not limited by network interface only, but they can use any link 
supporting SCSI-style data exchange, like Fibre Channel or SAS. Majority of modern 
SAN appliances, especially higher end ones, are SCST based. It might well be that 
your favorite storage appliance running SCST in the firmware.


More info about SCST and its modules you can find on: 
http://scst.sourceforge.net

Thanks to all who made it happen, especially Bart Van Assche, who prepared the 
packages!


Vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-28 Thread Vladislav Bolkhovitin


Nico Williams, on 11/26/2012 03:05 PM wrote:

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this. Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.


Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.


Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.


There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.


As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?



So just give us a barrier.


Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.


If you look at it, it offers exactly what you want, only named correctly.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-28 Thread Vladislav Bolkhovitin


Nico Williams, on 11/26/2012 03:05 PM wrote:

Vlad,

You keep saying that programmers don't understand barriers.  You've
provided no evidence of this. Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
barrier is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.


Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.


Simplifying, memory barrier instructions are not cache flush of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.


There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.


As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?



So just give us a barrier.


Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.


If you look at it, it offers exactly what you want, only named correctly.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-19 Thread Vladislav Bolkhovitin

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, 
which
would allow to set the required capabilities, i.e. if this request is FUA, or 
full
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
each iocb.

For the regular read()/write() I would add to "flags" parameter of
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make
the latest submitted write in this fd ORDERED.


Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.


For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.


(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-19 Thread Vladislav Bolkhovitin

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, 
which
would allow to set the required capabilities, i.e. if this request is FUA, or 
full
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
each iocb.

For the regular read()/write() I would add to flags parameter of
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make
the latest submitted write in this fd ORDERED.


Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.


For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.


(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin


Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin

David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as "barrier". It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin

David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as barrier. It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-16 Thread Vladislav Bolkhovitin


Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to flags parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-14 Thread Vladislav Bolkhovitin


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-14 Thread Vladislav Bolkhovitin


Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-14 Thread Vladislav Bolkhovitin


Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-14 Thread Vladislav Bolkhovitin


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as barrier. It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware = less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin

Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin


Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin


Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin


Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because there's no
difference where to drain the queue, on the kernel or the storage side?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin


Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin

Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word barrier for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-12 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
topological order, and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-01 Thread Vladislav Bolkhovitin


Alan Cox, on 11/01/2012 05:24 PM wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


OK, then we have a good progress!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-01 Thread Vladislav Bolkhovitin


Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-01 Thread Vladislav Bolkhovitin


Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the gold-plated storage manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5 hotplug drives in a 5.25 bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because there's no 
difference where to drain the queue, on the kernel or the storage side?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-11-01 Thread Vladislav Bolkhovitin


Alan Cox, on 11/01/2012 05:24 PM wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because there's no
difference where to drain the queue, on the kernel or the storage side?


Send patches.


OK, then we have a good progress!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-30 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-30 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the gold-plated storage manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But affordable and useful are completely 
orthogonal things. The high end features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from value-for-money, but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices high-end.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
barriers concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive enterprise
drives, it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-26 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-23 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sqlite] light weight write barriers

2012-10-23 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully fill the pipe, using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
barriers concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/6] target/file: Re-enable optional fd_buffered_io=1 operation

2012-10-02 Thread Vladislav Bolkhovitin

Christoph Hellwig, on 10/01/2012 04:46 AM wrote:

On Sun, Sep 30, 2012 at 05:58:11AM +, Nicholas A. Bellinger wrote:

From: Nicholas Bellinger

This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.

This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.


What this lacks is a clear reason why you would enable this inherently
unsafe mode.  While there is some clear precedence to allow people doing
stupid thing I'd least like a rationale for it, and it being documented
as unsafe.


Nowadays nearly all serious applications are transactional, and know how to flush 
storage cache between transactions. That means that write back caching is 
absolutely safe for them. No data can't be lost in any circumstances.


Welcome to the 21 century, Christoph!

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/6] target/file: Re-enable optional fd_buffered_io=1 operation

2012-10-02 Thread Vladislav Bolkhovitin

Christoph Hellwig, on 10/01/2012 04:46 AM wrote:

On Sun, Sep 30, 2012 at 05:58:11AM +, Nicholas A. Bellinger wrote:

From: Nicholas Bellingern...@linux-iscsi.org

This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.

This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.


What this lacks is a clear reason why you would enable this inherently
unsafe mode.  While there is some clear precedence to allow people doing
stupid thing I'd least like a rationale for it, and it being documented
as unsafe.


Nowadays nearly all serious applications are transactional, and know how to flush 
storage cache between transactions. That means that write back caching is 
absolutely safe for them. No data can't be lost in any circumstances.


Welcome to the 21 century, Christoph!

Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Vladislav Bolkhovitin

Tomasz Chmielewski wrote:

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.



Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.



After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually 
remove about 5-20 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.



After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).



Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?



Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?


It would be really interesting if you try your workload with XFS. In my 
experience, XFS considerably outperforms ext3 on big (> few hundreds MB) 
disks.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Vladislav Bolkhovitin

Tomasz Chmielewski wrote:

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.



Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.



After cache is emptied (echo 3  /proc/sys/vm/drop_caches) I can usually 
remove about 5-20 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.



After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).



Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?



Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?


It would be really interesting if you try your workload with XFS. In my 
experience, XFS considerably outperforms ext3 on big ( few hundreds MB) 
disks.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-11 Thread Vladislav Bolkhovitin

Luben Tuikov wrote:

Is there an open iSCSI Target implementation which


does NOT


issue commands to sub-target devices via the SCSI


mid-layer, but


bypasses it completely?


What do you mean? To call directly low level backstorage
SCSI drivers 
queuecommand() routine? What are advantages of it?


Yes, that's what I meant.  Just curious.


What's advantage of it?


Thanks,
   Luben

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-11 Thread Vladislav Bolkhovitin

Luben Tuikov wrote:

Is there an open iSCSI Target implementation which


does NOT


issue commands to sub-target devices via the SCSI


mid-layer, but


bypasses it completely?


What do you mean? To call directly low level backstorage
SCSI drivers 
queuecommand() routine? What are advantages of it?


Yes, that's what I meant.  Just curious.


What's advantage of it?


Thanks,
   Luben

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.




I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


nod

I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-08 Thread Vladislav Bolkhovitin

Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin

Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin

Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin

James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin

James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

Jeff Garzik wrote:
iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


Actually, think about those multiple connections...  we already had to 
implement fast-failover (and load bal) SCSI multi-pathing at a higher 
level.  IMO that portion of the protocol is redundant:   You need the 
same capability elsewhere in the OS _anyway_, if you are to support 
multi-pathing.


I'm thinking about MC/S as about a way to improve performance using 
several physical links. There's no other way, except MC/S, to keep 
commands processing order in that case. So, it's really valuable 
property of iSCSI, although with a limited application.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

Erez Zilber wrote:

Bart Van Assche wrote:


As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com_name=scst-devel.

 


Sorry for the late response (but better late than never).

One may claim that STGT should have lower performance than SCST because
its data path is from userspace. However, your results show that for
non-IB transports, they both show the same numbers. Furthermore, with IB
there shouldn't be any additional difference between the 2 targets
because data transfer from userspace is as efficient as data transfer
from kernel space.


And now consider if one target has zero-copy cached I/O. How much that 
will improve its performance?



The only explanation that I see is that fine tuning for iSCSI & iSER is
required. As was already mentioned in this thread, with SDR you can get
~900 MB/sec with iSER (on STGT).

Erez

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

Jeff Garzik wrote:

Alan Cox wrote:

better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.



AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).

But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
really do tagged queueing, etc.



iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


And now think about iSER, which brings iSCSI on the whole new complexity 
level ;)


It's an Internet protocol designed 
by storage designers, what do you expect?


For years I have been hoping that someone will invent a simple protocol 
(w/ strong auth) that can transit ATA and SCSI commands and responses. 
Heck, it would be almost trivial if the kernel had a TLS/SSL implementation.


Jeff

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

Linus Torvalds wrote:

I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem



.. not even shared. It was hard to get correct semantics full stop. 

Which is a traditional problem. The thing is, the kernel always has some 
internal state, and it's hard to expose all the semantics that the kernel 
knows about to user space.


So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?


The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having "file handles" (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.


Yes, there is something like that for SCSI target as well. It's a "local 
initiator" or "local nexus", see 
http://thread.gmane.org/gmane.linux.scsi/31288 and 
http://news.gmane.org/find-root.php?message_id=%3c463F36AC.3010207%40vlnb.net%3e 
for more info about that.


In fact, existence of local nexus is one more point why SCST is better, 
than STGT, because for STGT it's pretty hard to support it (all locally 
generated commands would have to be passed through its daemon, which 
would be a total disaster for performance), while for SCST it can be 
done relatively simply.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

Linus Torvalds wrote:
So just going by what has happened in the past, I'd assume that iSCSI 
would eventually turn into "connecting/authentication in user space" with 
"data transfers in kernel space".


This is exactly how iSCSI-SCST (iSCSI target driver for SCST) is 
implemented, credits to IET and Ardis target developers.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Vladislav Bolkhovitin

James Bottomley wrote:

On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:



James Bottomley wrote:



On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:




James Bottomley wrote:



So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsi=120164008302435

and here:

http://marc.info/?l=linux-scsi=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernel=120169189504361=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O



from the backing store and then send this mmapped region off for target



I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size > size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 



Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.


James, just check and you will see, buffered I/O is a lot faster.


So in an out of memory situation the buffers you don't have are a lot
faster than the pages I don't have?


There isn't OOM in both cases. Just pages reclamation/readahead work 
much better in the buffered case.


So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.


I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. 


Err, to whom return? If you try to read from a mmaped page, which can't 
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
remember exactly. It's quite tricky to get back to the faulted command 
from the signal handler.


Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
think that such mapping/unmapping is good for performance?




msync does something
similar if there's a write failure.



You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?


Actually, just avoid touching it seems to do the trick with a recent
kernel.


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?



There's no actual evidence that copy problems are causing any
performatince issues issues for STGT.  In fact, there's evidence that
they're not for everything except IB networks.


The zero-copy cached I/O has not yet been implemented in SCST, I simply 
so far have not had time for that. Currently SCST performs better STGT, 
because of simpler processing path and less context switches per 
command. Memcpy() speed on modern systems is about th

  1   2   >