Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 02:34:52PM +, Mel Gorman wrote:
 On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
  On 01/22/2014 04:34 AM, Mel Gorman wrote:
  On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
  One topic that has been lurking forever at the edges is the current
  4k limitation for file system block sizes. Some devices in
  production today and others coming soon have larger sectors and it
  would be interesting to see if it is time to poke at this topic
  again.
  
  Large block support was proposed years ago by Christoph Lameter
  (http://lwn.net/Articles/232757/). I think I was just getting started
  in the community at the time so I do not recall any of the details. I do
  believe it motivated an alternative by Nick Piggin called fsblock though
  (http://lwn.net/Articles/321390/). At the very least it would be nice to
  know why neither were never merged for those of us that were not around
  at the time and who may not have the chance to dive through mailing list
  archives between now and March.
  
  FWIW, I would expect that a show-stopper for any proposal is requiring
  high-order allocations to succeed for the system to behave correctly.
  
  
  I have a somewhat hazy memory of Andrew warning us that touching
  this code takes us into dark and scary places.
  
 
 That is a light summary. As Andrew tends to reject patches with poor
 documentation in case we forget the details in 6 months, I'm going to guess
 that he does not remember the details of a discussion from 7ish years ago.
 This is where Andrew swoops in with a dazzling display of his eidetic
 memory just to prove me wrong.
 
 Ric, are there any storage vendor that is pushing for this right now?
 Is someone working on this right now or planning to? If they are, have they
 looked into the history of fsblock (Nick) and large block support (Christoph)
 to see if they are candidates for forward porting or reimplementation?
 I ask because without that person there is a risk that the discussion
 will go as follows
 
 Topic leader: Does anyone have an objection to supporting larger block
   sizes than the page size?
 Room: Send patches and we'll talk.

So, from someone who was done in the trenches of the large
filesystem block size code wars, the main objection to Christoph
lameter's patchset was that it used high order compound pages in the
page cache so that nothing at filesystem level needed to be changed
to support large block sizes.

The patch to enable XFS to use 64k block sizes with Christoph's
patches was simply removing 5 lines of code that limited the block
size to PAGE_SIZE. And everything just worked.

Given that compound pages are used all over the place now and we
also have page migration, compaction and other MM support that
greatly improves high order memory allocation, perhaps we should
revisit this approach.

As to Nick's fsblock rewrite, he basically rewrote all the
bufferhead head code to handle filesystem blocks larger than a page
whilst leaving the page cache untouched. i.e. the complete opposite
approach. The problem with this approach is that every filesystem
needs to be re-written to use fsblocks rather than bufferheads. For
some filesystems that isn't hard (e.g. ext2) but for filesystems
that use bufferheads in the core of their journalling subsystems
that's a completely different story.

And for filesystems like XFS, it doesn't solve any of the problem
with using bufferheads that we have now, so it simply introduces a
huge amount of IO path rework and validation without providing any
advantage from a feature or performance point of view. i.e. extent
based filesystems mostly negate the impact of filesystem block size
on IO performance...

Realistically, if I'm going to do something in XFS to add block size
 page size support, I'm going to do it wiht somethign XFS can track
through it's own journal so I can add data=journal functionality
with the same filesystem block/extent header structures used to
track the pages in blocks larger than PAGE_SIZE. And given that we
already have such infrastructure in XFS to support directory
blocks larger than filesystem block size

FWIW, as to the original large sector size support question, XFS
already supports sector sizes up to 32k in size. The limitation is
actually a limitation of the journal format, so going larger than
that would take some work...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
   On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
  
  [ I like big sectors and I cannot lie ]
 
 I think I might be sceptical, but I don't think that's showing in my
 concerns ...
 
I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.
   
   Do we even need to do that (eliminate buffer heads)?  We cope with 4k
   sector only devices just fine today because the bh mechanisms now
   operate on top of the page cache and can do the RMW necessary to update
   a bh in the page cache itself which allows us to do only 4k chunked
   writes, so we could keep the bh system and just alter the granularity of
   the page cache.
   
  
  We're likely to have people mixing 4K drives and fill in some other
  size here on the same box.  We could just go with the biggest size and
  use the existing bh code for the sub-pagesized blocks, but I really
  hesitate to change VM fundamentals for this.
 
 If the page cache had a variable granularity per device, that would cope
 with this.  It's the variable granularity that's the VM problem.
 
  From a pure code point of view, it may be less work to change it once in
  the VM.  But from an overall system impact point of view, it's a big
  change in how the system behaves just for filesystem metadata.
 
 Agreed, but only if we don't do RMW in the buffer cache ... which may be
 a good reason to keep it.
 
   The other question is if the drive does RMW between 4k and whatever its
   physical sector size, do we need to do anything to take advantage of
   it ... as in what would altering the granularity of the page cache buy
   us?
  
  The real benefit is when and how the reads get scheduled.  We're able to
  do a much better job pipelining the reads, controlling our caches and
  reducing write latency by having the reads done up in the OS instead of
  the drive.
 
 I agree with all of that, but my question is still can we do this by
 propagating alignment and chunk size information (i.e. the physical
 sector size) like we do today.  If the FS knows the optimal I/O patterns
 and tries to follow them, the odd cockup won't impact performance
 dramatically.  The real question is can the FS make use of this layout
 information *without* changing the page cache granularity?  Only if you
 answer me no to this do I think we need to worry about changing page
 cache granularity.

We already do this today.

The problem is that we are limited by the page cache assumption that
the block device/filesystem never need to manage multiple pages as
an atomic unit of change. Hence we can't use the generic
infrastructure as it stands to handle block/sector sizes larger than
a page size...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 09:21:40AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote:
   On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote:
On 01/22/2014 09:34 AM, Mel Gorman wrote:
On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
On 01/22/2014 04:34 AM, Mel Gorman wrote:
On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.

Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. 
I do
believe it motivated an alternative by Nick Piggin called fsblock 
though
(http://lwn.net/Articles/321390/). At the very least it would be 
nice to
know why neither were never merged for those of us that were not 
around
at the time and who may not have the chance to dive through mailing 
list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is 
requiring
high-order allocations to succeed for the system to behave correctly.

I have a somewhat hazy memory of Andrew warning us that touching
this code takes us into dark and scary places.

That is a light summary. As Andrew tends to reject patches with poor
documentation in case we forget the details in 6 months, I'm going to 
guess
that he does not remember the details of a discussion from 7ish years 
ago.
This is where Andrew swoops in with a dazzling display of his eidetic
memory just to prove me wrong.

Ric, are there any storage vendor that is pushing for this right now?
Is someone working on this right now or planning to? If they are, have 
they
looked into the history of fsblock (Nick) and large block support 
(Christoph)
to see if they are candidates for forward porting or reimplementation?
I ask because without that person there is a risk that the discussion
will go as follows

Topic leader: Does anyone have an objection to supporting larger block
   sizes than the page size?
Room: Send patches and we'll talk.


I will have to see if I can get a storage vendor to make a public
statement, but there are vendors hoping to see this land in Linux in
the next few years.
   
   What about the second and third questions -- is someone working on this
   right now or planning to? Have they looked into the history of fsblock
   (Nick) and large block support (Christoph) to see if they are candidates
   for forward porting or reimplementation?
  
  I really think that if we want to make progress on this one, we need
  code and someone that owns it.  Nick's work was impressive, but it was
  mostly there for getting rid of buffer heads.  If we have a device that
  needs it and someone working to enable that device, we'll go forward
  much faster.
 
 Do we even need to do that (eliminate buffer heads)?

No, the reason bufferheads were replaced was that a bufferhead can
only reference a single page. i.e. the structure is that a page can
reference multipl bufferheads (block size = page size) but a
bufferhead can't refernce multiple pages which is what is needed for
block size  page size. fsblock was designed to handle both cases.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 11:50:02AM -0800, Andrew Morton wrote:
 On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
 
  But this, I think, is the fundamental point for debate.  If we can pull
  alignment and other tricks to solve 99% of the problem is there a need
  for radical VM surgery?  Is there anything coming down the pipe in the
  future that may move the devices ahead of the tricks?
 
 I expect it would be relatively simple to get large blocksizes working
 on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
 amounts of work, perhaps someone can do a proof-of-concept on powerpc
 (or ia64) with 64k blocksize.

Reality check: 64k block sizes on 64k page Linux machines has been
used in production on XFS for at least 10 years. It's exactly the
same case as 4k block size on 4k page size - one page, one buffer
head, one filesystem block.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] scsi-sd: removed unused SD_PASSTHROUGH_RETRIES

2014-01-23 Thread Sha Zhengju
From: Sha Zhengju handai@taobao.com

Signed-off-by: Sha Zhengju handai@taobao.com
---
 drivers/scsi/sd.h |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 26895ff..3bbe4df 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -24,7 +24,6 @@
  * Number of allowed retries
  */
 #define SD_MAX_RETRIES 5
-#define SD_PASSTHROUGH_RETRIES 1
 #define SD_MAX_MEDIUM_TIMEOUTS 2
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] isci: update version to 1.2

2014-01-23 Thread Lukasz Dorau
The version of isci driver has not been updated for 2 years.
It was 83 isci commits ago. Suspend/resume support has been implemented
and many bugs have been fixed since 1.1. Now update the version to 1.2.

Signed-off-by: Lukasz Dorau lukasz.do...@intel.com
Cc: sta...@vger.kernel.org
---
 drivers/scsi/isci/init.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
index d25d0d8..695b34e 100644
--- a/drivers/scsi/isci/init.c
+++ b/drivers/scsi/isci/init.c
@@ -66,7 +66,7 @@
 #include probe_roms.h
 
 #define MAJ 1
-#define MIN 1
+#define MIN 2
 #define BUILD 0
 #define DRV_VERSION __stringify(MAJ) . __stringify(MIN) . \
__stringify(BUILD)

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] isci: update version to 1.2

2014-01-23 Thread Dorau, Lukasz
On Thursday, January 23, 2014 10:39 AM Lukasz Dorau lukasz.do...@intel.com 
wrote:
 The version of isci driver has not been updated for 2 years.
 It was 83 isci commits ago. Suspend/resume support has been implemented
 and many bugs have been fixed since 1.1. Now update the version to 1.2.
 
 Signed-off-by: Lukasz Dorau lukasz.do...@intel.com
 Cc: sta...@vger.kernel.org

Oops... By mistake I have sent the wrong version of the patch. I'm sorry.
Please disregard it.

Lukasz
 


[PATCH] isci: update version to 1.2

2014-01-23 Thread Lukasz Dorau
The version of isci driver has not been updated for 2 years.
It was 83 isci commits ago. Suspend/resume support has been implemented
and many bugs have been fixed since 1.1. Now update the version to 1.2.

Signed-off-by: Lukasz Dorau lukasz.do...@intel.com
Signed-off-by: Dave Jiang dave.ji...@intel.com
Signed-off-by: Maciej Patelczyk maciej.patelc...@intel.com
---
 drivers/scsi/isci/init.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
index d25d0d8..695b34e 100644
--- a/drivers/scsi/isci/init.c
+++ b/drivers/scsi/isci/init.c
@@ -66,7 +66,7 @@
 #include probe_roms.h
 
 #define MAJ 1
-#define MIN 1
+#define MIN 2
 #define BUILD 0
 #define DRV_VERSION __stringify(MAJ) . __stringify(MIN) . \
__stringify(BUILD)

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [usb-storage] Re: usb disk recognized but fails

2014-01-23 Thread Milan Svoboda
Whoaa!!

I recompiled the master again, but now with a little bit modified 
configuration, mainly I disabled the CONFIG_USB_STORAGE_CYPRESS_ATACB and
it works like a charm! Disk is properly and immediately detected and works!

I also tried to boot to standard kernel and disable loading ums_cypress by 
putting it on the blacklist but it didn't worked out. The
disk wasn't detected at all (no message about plug-in event nor report about 
disk size).

I strongly belive that it is Linux kernel problem, not the disk's (apart it 
might need some quirks). If I remember correctly there hasn't
been the ums_cypress from the begining, right? So, perhaps the time when it was 
added corresponds with the time when it
worked for me last time.

Best regards and thanks for all your help and wish for a quick fix in the 
mainstream,
Milan Svoboda

--- .config.old 2014-01-23 12:57:17.831854511 +0100
+++ .config 2014-01-23 10:13:20.899234729 +0100
@@ -1,6 +1,6 @@
 #
 # Automatically generated file; DO NOT EDIT.
-# Linux/x86 3.12.8-1 Kernel Configuration
+# Linux/x86 3.13.0 Kernel Configuration
 #
 CONFIG_64BIT=y
 CONFIG_X86_64=y
@@ -39,7 +39,6 @@ CONFIG_HAVE_INTEL_TXT=y
 CONFIG_X86_64_SMP=y
 CONFIG_X86_HT=y
 CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11
-CONFIG_ARCH_CPU_PROBE_RELEASE=y
 CONFIG_ARCH_SUPPORTS_UPROBES=y
 CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
 CONFIG_IRQ_WORK=y
@@ -76,7 +75,6 @@ CONFIG_AUDIT=y
 CONFIG_AUDITSYSCALL=y
 CONFIG_AUDIT_WATCH=y
 CONFIG_AUDIT_TREE=y
-CONFIG_AUDIT_LOGINUID_IMMUTABLE=y
 
 #
 # IRQ subsystem
@@ -143,6 +141,7 @@ CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=19
 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
+CONFIG_ARCH_SUPPORTS_INT128=y
 CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y
 CONFIG_ARCH_USES_NUMA_PROT_NONE=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
@@ -167,7 +166,7 @@ CONFIG_CFS_BANDWIDTH=y
 CONFIG_RT_GROUP_SCHED=y
 CONFIG_BLK_CGROUP=y
 # CONFIG_DEBUG_BLK_CGROUP is not set
-CONFIG_CHECKPOINT_RESTORE=y
+# CONFIG_CHECKPOINT_RESTORE is not set
 CONFIG_NAMESPACES=y
 CONFIG_UTS_NS=y
 CONFIG_IPC_NS=y
@@ -247,7 +246,6 @@ CONFIG_HAVE_OPTPROBES=y
 CONFIG_HAVE_KPROBES_ON_FTRACE=y
 CONFIG_HAVE_ARCH_TRACEHOOK=y
 CONFIG_HAVE_DMA_ATTRS=y
-CONFIG_USE_GENERIC_SMP_HELPERS=y
 CONFIG_GENERIC_SMP_IDLE_THREAD=y
 CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
 CONFIG_HAVE_DMA_API_DEBUG=y
@@ -266,11 +264,18 @@ CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSIO
 CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
 CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
 CONFIG_SECCOMP_FILTER=y
+CONFIG_HAVE_CC_STACKPROTECTOR=y
+# CONFIG_CC_STACKPROTECTOR is not set
+CONFIG_CC_STACKPROTECTOR_NONE=y
+# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
+# CONFIG_CC_STACKPROTECTOR_STRONG is not set
 CONFIG_HAVE_CONTEXT_TRACKING=y
+CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
 CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
 CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
 CONFIG_HAVE_ARCH_SOFT_DIRTY=y
 CONFIG_MODULES_USE_ELF_RELA=y
+CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
 CONFIG_OLD_SIGSUSPEND3=y
 CONFIG_COMPAT_OLD_SIGACTION=y
 
@@ -282,6 +287,7 @@ CONFIG_COMPAT_OLD_SIGACTION=y
 CONFIG_SLABINFO=y
 CONFIG_RT_MUTEXES=y
 CONFIG_BASE_SMALL=0
+# CONFIG_SYSTEM_TRUSTED_KEYRING is not set
 CONFIG_MODULES=y
 CONFIG_MODULE_FORCE_LOAD=y
 CONFIG_MODULE_UNLOAD=y
@@ -453,6 +459,7 @@ CONFIG_MEMORY_HOTPLUG_SPARSE=y
 CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
+CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
 CONFIG_BALLOON_COMPACTION=y
 CONFIG_COMPACTION=y
 CONFIG_MIGRATION=y
@@ -475,7 +482,6 @@ CONFIG_FRONTSWAP=y
 # CONFIG_CMA is not set
 CONFIG_ZBUD=y
 CONFIG_ZSWAP=y
-CONFIG_MEM_SOFT_DIRTY=y
 CONFIG_X86_CHECK_BIOS_CORRUPTION=y
 CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
 CONFIG_X86_RESERVE_LOW=64
@@ -490,7 +496,6 @@ CONFIG_X86_SMAP=y
 CONFIG_EFI=y
 CONFIG_EFI_STUB=y
 CONFIG_SECCOMP=y
-CONFIG_CC_STACKPROTECTOR=y
 # CONFIG_HZ_100 is not set
 # CONFIG_HZ_250 is not set
 CONFIG_HZ_300=y
@@ -533,13 +538,13 @@ CONFIG_PM_DEBUG=y
 CONFIG_PM_ADVANCED_DEBUG=y
 # CONFIG_PM_TEST_SUSPEND is not set
 CONFIG_PM_SLEEP_DEBUG=y
+# CONFIG_DPM_WATCHDOG is not set
 CONFIG_PM_TRACE=y
 CONFIG_PM_TRACE_RTC=y
 # CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
 CONFIG_ACPI=y
 CONFIG_ACPI_SLEEP=y
 # CONFIG_ACPI_PROCFS is not set
-# CONFIG_ACPI_PROCFS_POWER is not set
 CONFIG_ACPI_EC_DEBUGFS=m
 CONFIG_ACPI_AC=m
 CONFIG_ACPI_BATTERY=m
@@ -555,7 +560,6 @@ CONFIG_ACPI_THERMAL=m
 CONFIG_ACPI_NUMA=y
 # CONFIG_ACPI_CUSTOM_DSDT is not set
 CONFIG_ACPI_INITRD_TABLE_OVERRIDE=y
-CONFIG_ACPI_BLACKLIST_YEAR=0
 # CONFIG_ACPI_DEBUG is not set
 CONFIG_ACPI_PCI_SLOT=y
 CONFIG_X86_PM_TIMER=y
@@ -571,13 +575,13 @@ CONFIG_ACPI_APEI_PCIEAER=y
 CONFIG_ACPI_APEI_MEMORY_FAILURE=y
 CONFIG_ACPI_APEI_EINJ=m
 CONFIG_ACPI_APEI_ERST_DEBUG=m
+# CONFIG_ACPI_EXTLOG is not set
 CONFIG_SFI=y
 
 #
 # CPU Frequency scaling
 #
 CONFIG_CPU_FREQ=y
-CONFIG_CPU_FREQ_TABLE=y
 

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Theodore Ts'o
On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
  
  I expect it would be relatively simple to get large blocksizes working
  on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
  amounts of work, perhaps someone can do a proof-of-concept on powerpc
  (or ia64) with 64k blocksize.
 
 Reality check: 64k block sizes on 64k page Linux machines has been
 used in production on XFS for at least 10 years. It's exactly the
 same case as 4k block size on 4k page size - one page, one buffer
 head, one filesystem block.

This is true for ext4 as well.  Block size == page size support is
pretty easy; the hard part is when block size  page size, due to
assumptions in the VM layer that requires that FS system needs to do a
lot of extra work to fudge around.  So the real problem comes with
trying to support 64k block sizes on a 4k page architecture, and can
we do it in a way where every single file system doesn't have to do
their own specific hacks to work around assumptions made in the VM
layer.

Some of the problems include handling the case where you get someone
dirties a single block in a sparse page, and the FS needs to manually
fault in the other 56k pages around that single page.  Or the VM not
understanding that page eviction needs to be done in chunks of 64k so
we don't have part of the block evicted but not all of it, etc.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread James Bottomley
On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
 On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
  On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
   On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
   
   [ I like big sectors and I cannot lie ]
  
  I think I might be sceptical, but I don't think that's showing in my
  concerns ...
  
 I really think that if we want to make progress on this one, we need
 code and someone that owns it.  Nick's work was impressive, but it was
 mostly there for getting rid of buffer heads.  If we have a device 
 that
 needs it and someone working to enable that device, we'll go forward
 much faster.

Do we even need to do that (eliminate buffer heads)?  We cope with 4k
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.

   
   We're likely to have people mixing 4K drives and fill in some other
   size here on the same box.  We could just go with the biggest size and
   use the existing bh code for the sub-pagesized blocks, but I really
   hesitate to change VM fundamentals for this.
  
  If the page cache had a variable granularity per device, that would cope
  with this.  It's the variable granularity that's the VM problem.
  
   From a pure code point of view, it may be less work to change it once in
   the VM.  But from an overall system impact point of view, it's a big
   change in how the system behaves just for filesystem metadata.
  
  Agreed, but only if we don't do RMW in the buffer cache ... which may be
  a good reason to keep it.
  
The other question is if the drive does RMW between 4k and whatever its
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?
   
   The real benefit is when and how the reads get scheduled.  We're able to
   do a much better job pipelining the reads, controlling our caches and
   reducing write latency by having the reads done up in the OS instead of
   the drive.
  
  I agree with all of that, but my question is still can we do this by
  propagating alignment and chunk size information (i.e. the physical
  sector size) like we do today.  If the FS knows the optimal I/O patterns
  and tries to follow them, the odd cockup won't impact performance
  dramatically.  The real question is can the FS make use of this layout
  information *without* changing the page cache granularity?  Only if you
  answer me no to this do I think we need to worry about changing page
  cache granularity.
 
 We already do this today.
 
 The problem is that we are limited by the page cache assumption that
 the block device/filesystem never need to manage multiple pages as
 an atomic unit of change. Hence we can't use the generic
 infrastructure as it stands to handle block/sector sizes larger than
 a page size...

If the compound page infrastructure exists today and is usable for this,
what else do we need to do? ... because if it's a couple of trivial
changes and a few minor patches to filesystems to take advantage of it,
we might as well do it anyway.  I was only objecting on the grounds that
the last time we looked at it, it was major VM surgery.  Can someone
give a summary of how far we are away from being able to do this with
the VM system today and what extra work is needed (and how big is this
piece of work)?

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Mel Gorman
On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
 On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
  On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
   On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:

[ I like big sectors and I cannot lie ]
   
   I think I might be sceptical, but I don't think that's showing in my
   concerns ...
   
  I really think that if we want to make progress on this one, we need
  code and someone that owns it.  Nick's work was impressive, but it 
  was
  mostly there for getting rid of buffer heads.  If we have a device 
  that
  needs it and someone working to enable that device, we'll go forward
  much faster.
 
 Do we even need to do that (eliminate buffer heads)?  We cope with 4k
 sector only devices just fine today because the bh mechanisms now
 operate on top of the page cache and can do the RMW necessary to 
 update
 a bh in the page cache itself which allows us to do only 4k chunked
 writes, so we could keep the bh system and just alter the granularity 
 of
 the page cache.
 

We're likely to have people mixing 4K drives and fill in some other
size here on the same box.  We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.
   
   If the page cache had a variable granularity per device, that would cope
   with this.  It's the variable granularity that's the VM problem.
   
From a pure code point of view, it may be less work to change it once in
the VM.  But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.
   
   Agreed, but only if we don't do RMW in the buffer cache ... which may be
   a good reason to keep it.
   
 The other question is if the drive does RMW between 4k and whatever 
 its
 physical sector size, do we need to do anything to take advantage of
 it ... as in what would altering the granularity of the page cache buy
 us?

The real benefit is when and how the reads get scheduled.  We're able to
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.
   
   I agree with all of that, but my question is still can we do this by
   propagating alignment and chunk size information (i.e. the physical
   sector size) like we do today.  If the FS knows the optimal I/O patterns
   and tries to follow them, the odd cockup won't impact performance
   dramatically.  The real question is can the FS make use of this layout
   information *without* changing the page cache granularity?  Only if you
   answer me no to this do I think we need to worry about changing page
   cache granularity.
  
  We already do this today.
  
  The problem is that we are limited by the page cache assumption that
  the block device/filesystem never need to manage multiple pages as
  an atomic unit of change. Hence we can't use the generic
  infrastructure as it stands to handle block/sector sizes larger than
  a page size...
 
 If the compound page infrastructure exists today and is usable for this,
 what else do we need to do? ... because if it's a couple of trivial
 changes and a few minor patches to filesystems to take advantage of it,
 we might as well do it anyway. 

Do not do this as there is no guarantee that a compound allocation will
succeed. If the allocation fails then it is potentially unrecoverable
because we can no longer write to storage then you're hosed. If you are
now thinking mempool then the problem becomes that the system will be
in a state of degraded performance for an unknowable length of time and
may never recover fully. 64K MMU page size systems get away with this
because the blocksize is still = PAGE_SIZE and no core VM changes are
necessary. Critically, pages like the page table pages are the same size as
the basic unit of allocation used by the kernel so external fragmentation
simply is not a severe problem.

 I was only objecting on the grounds that
 the last time we looked at it, it was major VM surgery.  Can someone
 give a summary of how far we are away from being able to do this with
 the VM system today and what extra work is needed (and how big is this
 piece of work)?
 

Offhand no idea. For fsblock, probably a similar amount of work than
had to be done in 2007 and I'd expect it would still require filesystem
awareness problems that Dave Chinner pointer out earlier. For large block,
it'd hit into the same wall that allocations must always succeed. If we
want to break the connection between the basic unit of memory managed
by the kernel and the MMU page size then I don't know but it would 

Re: [usb-storage] Re: usb disk recognized but fails

2014-01-23 Thread Alan Stern
On Thu, 23 Jan 2014, Milan Svoboda wrote:

 Whoaa!!
 
 I recompiled the master again, but now with a little bit modified 
 configuration, mainly I disabled the CONFIG_USB_STORAGE_CYPRESS_ATACB and
 it works like a charm! Disk is properly and immediately detected and works!

I don't see how that could have made any difference.  The Cypress-ATACB 
driver works just like the default driver, except for two commands
(ATA(12) and ATA(16)) neither of which appeared in the usbmon trace.

Your new config enables CONFIG_USB_STORAGE_DEBUG.  More likely that is 
the reason for the improvement.  Try taking out that one setting (don't 
change anything else) and see what happens.

 I also tried to boot to standard kernel and disable loading ums_cypress by 
 putting it on the blacklist but it didn't worked out. The
 disk wasn't detected at all (no message about plug-in event nor report about 
 disk size).
 
 I strongly belive that it is Linux kernel problem, not the disk's (apart it 
 might need some quirks). If I remember correctly there hasn't
 been the ums_cypress from the begining, right? So, perhaps the time when it 
 was added corresponds with the time when it
 worked for me last time.

What do you mean by from the beginning?  The ums-cypress driver was
added in 2008.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.

2014-01-23 Thread Mikulas Patocka
When the controller encounters an error (including QUEUE FULL or BUSY 
status), it aborts all not yet submitted requests in the function 
sym_dequeue_from_squeue.

This function aborts them with DID_SOFT_ERROR.

If the disk has a full tag queue, the request that caused the overflow is 
aborted with QUEUE FULL status (and the scsi midlayer properly retries it 
until it is accepted by the disk), but other requests are aborted with 
DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then 
signals the error up to sd.

The result is that disk returning QUEUE FULL causes request failures.

The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded 
ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but 
under some access patterns it return QUEUE FULL when there are less than 
64 pending tags. The SCSI specification allows returning QUEUE FULL 
anytime and it is up to the host to retry.

Signed-off-by: Mikulas Patocka mpato...@redhat.com
Cc: sta...@vger.kernel.org

---
 drivers/scsi/sym53c8xx_2/sym_hipd.c |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c
===
--- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c  
2010-09-27 10:25:59.0 +0200
+++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c   2010-09-27 
10:26:27.0 +0200
@@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb *
if ((target == -1 || cp-target == target) 
(lun== -1 || cp-lun== lun)
(task   == -1 || cp-tag== task)) {
+#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING
sym_set_cam_status(cp-cmd, DID_SOFT_ERROR);
+#else
+   sym_set_cam_status(cp-cmd, DID_REQUEUE);
+#endif
sym_remque(cp-link_ccbq);
sym_insque_tail(cp-link_ccbq, np-comp_ccbq);
}
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Persistent reservation behaviour/compliance with redundant controllers

2014-01-23 Thread Lee Duncan
On 01/07/2014 12:18 PM, Pasi Kärkkäinen wrote:
 On Mon, Jan 06, 2014 at 11:53:44PM +0100, Matthias Eble wrote:

 I have a persistent reservations for dummies document I wrote that I
 can send you off list, if you like.

 I think I know how PRs work. Yet I'd be happy about your document.

 
 I think that document could be helpful for others aswell, so please post it 
 to the list :)
 
 Thanks!
 
 -- Pasi
 


Apologies for taking so darn long to reply!

I have published my SCSI-3 Document here:

  http://www.gonzoleeman.net/documents/scsi-3-pgr-tutorial-v1.0

Feedback welcome.
-- 
Lee Duncan
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote:
 On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
   
   I expect it would be relatively simple to get large blocksizes working
   on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
   amounts of work, perhaps someone can do a proof-of-concept on powerpc
   (or ia64) with 64k blocksize.
  
  Reality check: 64k block sizes on 64k page Linux machines has been
  used in production on XFS for at least 10 years. It's exactly the
  same case as 4k block size on 4k page size - one page, one buffer
  head, one filesystem block.
 
 This is true for ext4 as well.  Block size == page size support is
 pretty easy; the hard part is when block size  page size, due to
 assumptions in the VM layer that requires that FS system needs to do a
 lot of extra work to fudge around.  So the real problem comes with
 trying to support 64k block sizes on a 4k page architecture, and can
 we do it in a way where every single file system doesn't have to do
 their own specific hacks to work around assumptions made in the VM
 layer.
 
 Some of the problems include handling the case where you get someone
 dirties a single block in a sparse page, and the FS needs to manually
 fault in the other 56k pages around that single page.  Or the VM not
 understanding that page eviction needs to be done in chunks of 64k so
 we don't have part of the block evicted but not all of it, etc.

Right, this is part of the problem that fsblock tried to handle, and
some of the nastiness it had was that a page fault only resulted in
the individual page being read from the underlying block. This means
that it was entirely possible that the filesystem would need to do
RMW cycles in the writeback path itself to handle things like block
checksums, copy-on-write, unwritten extent conversion, etc. i.e. all
the stuff that the page cache currently handles by doing RMW cycles
at the page level.

The method of using compound pages in the page cache so that the
page cache could do 64k RMW cycles so that a filesystem never had to
deal with new issues like the above was one of the reasons that
approach is so appealing to us filesystem people. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread James Bottomley
On Thu, 2014-01-23 at 16:44 +, Mel Gorman wrote:
 On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
  On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
   On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
 On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
  On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
 
 [ I like big sectors and I cannot lie ]

I think I might be sceptical, but I don't think that's showing in my
concerns ...

   I really think that if we want to make progress on this one, we 
   need
   code and someone that owns it.  Nick's work was impressive, but 
   it was
   mostly there for getting rid of buffer heads.  If we have a 
   device that
   needs it and someone working to enable that device, we'll go 
   forward
   much faster.
  
  Do we even need to do that (eliminate buffer heads)?  We cope with 
  4k
  sector only devices just fine today because the bh mechanisms now
  operate on top of the page cache and can do the RMW necessary to 
  update
  a bh in the page cache itself which allows us to do only 4k chunked
  writes, so we could keep the bh system and just alter the 
  granularity of
  the page cache.
  
 
 We're likely to have people mixing 4K drives and fill in some other
 size here on the same box.  We could just go with the biggest size 
 and
 use the existing bh code for the sub-pagesized blocks, but I really
 hesitate to change VM fundamentals for this.

If the page cache had a variable granularity per device, that would cope
with this.  It's the variable granularity that's the VM problem.

 From a pure code point of view, it may be less work to change it once 
 in
 the VM.  But from an overall system impact point of view, it's a big
 change in how the system behaves just for filesystem metadata.

Agreed, but only if we don't do RMW in the buffer cache ... which may be
a good reason to keep it.

  The other question is if the drive does RMW between 4k and whatever 
  its
  physical sector size, do we need to do anything to take advantage of
  it ... as in what would altering the granularity of the page cache 
  buy
  us?
 
 The real benefit is when and how the reads get scheduled.  We're able 
 to
 do a much better job pipelining the reads, controlling our caches and
 reducing write latency by having the reads done up in the OS instead 
 of
 the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me no to this do I think we need to worry about changing page
cache granularity.
   
   We already do this today.
   
   The problem is that we are limited by the page cache assumption that
   the block device/filesystem never need to manage multiple pages as
   an atomic unit of change. Hence we can't use the generic
   infrastructure as it stands to handle block/sector sizes larger than
   a page size...
  
  If the compound page infrastructure exists today and is usable for this,
  what else do we need to do? ... because if it's a couple of trivial
  changes and a few minor patches to filesystems to take advantage of it,
  we might as well do it anyway. 
 
 Do not do this as there is no guarantee that a compound allocation will
 succeed.

I presume this is because in the current implementation compound pages
have to be physically contiguous.  For increasing granularity in the
page cache, we don't necessarily need this ... however, getting write
out to work properly without physically contiguous pages would be a bit
more challenging (but not impossible) to solve.

  If the allocation fails then it is potentially unrecoverable
 because we can no longer write to storage then you're hosed. If you are
 now thinking mempool then the problem becomes that the system will be
 in a state of degraded performance for an unknowable length of time and
 may never recover fully. 64K MMU page size systems get away with this
 because the blocksize is still = PAGE_SIZE and no core VM changes are
 necessary. Critically, pages like the page table pages are the same size as
 the basic unit of allocation used by the kernel so external fragmentation
 simply is not a severe problem.

Right, I understand this ... but we still need to wonder about what it
would take.  Even the simple fail a compound page allocation gets
treated in the kernel the same 

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Thu, Jan 23, 2014 at 04:44:38PM +, Mel Gorman wrote:
 On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
  On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
   On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
  The other question is if the drive does RMW between 4k and whatever 
  its
  physical sector size, do we need to do anything to take advantage of
  it ... as in what would altering the granularity of the page cache 
  buy
  us?
 
 The real benefit is when and how the reads get scheduled.  We're able 
 to
 do a much better job pipelining the reads, controlling our caches and
 reducing write latency by having the reads done up in the OS instead 
 of
 the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me no to this do I think we need to worry about changing page
cache granularity.
   
   We already do this today.
   
   The problem is that we are limited by the page cache assumption that
   the block device/filesystem never need to manage multiple pages as
   an atomic unit of change. Hence we can't use the generic
   infrastructure as it stands to handle block/sector sizes larger than
   a page size...
  
  If the compound page infrastructure exists today and is usable for this,
  what else do we need to do? ... because if it's a couple of trivial
  changes and a few minor patches to filesystems to take advantage of it,
  we might as well do it anyway. 
 
 Do not do this as there is no guarantee that a compound allocation will
 succeed. If the allocation fails then it is potentially unrecoverable
 because we can no longer write to storage then you're hosed.  If you are
 now thinking mempool then the problem becomes that the system will be
 in a state of degraded performance for an unknowable length of time and
 may never recover fully.

We are talking about page cache allocation here, not something deep
down inside the IO path that requires mempools to guarantee IO
completion. IOWs, we have an *existing error path* to return ENOMEM
to userspace when page cache allocation fails.

 64K MMU page size systems get away with this
 because the blocksize is still = PAGE_SIZE and no core VM changes are
 necessary. Critically, pages like the page table pages are the same size as
 the basic unit of allocation used by the kernel so external fragmentation
 simply is not a severe problem.

Christoph's old patches didn't need 64k MMU page sizes to work.
IIRC, the compound page was mapped via into the page cache as
individual 4k pages. Any change of state on the child pages followed
the back pointer to the head of the compound page and changed the
state of that page. On page faults, the individual 4k pages were
mapped to userspace rather than the compound page, so there was no
userspace visible change, either.

The question I had at the time that was never answered was this: if
pages are faulted and mapped individually through their own ptes,
why did the compound pages need to be contiguous? copy-in/out
through read/write was still done a PAGE_SIZE granularity, mmap
mappings were still on PAGE_SIZE granularity, so why can't we build
a compound page for the page cache out of discontiguous pages?

FWIW, XFS has long used discontiguous pages for large block support
in metadata. Some of that is vmapped to make metadata processing
simple. The point of this is that we don't need *contiguous*
compound pages in the page cache if we can map them into userspace
as individual PAGE_SIZE pages. Only the page cache management needs
to handle the groups of pages that make up a filesystem block
as a compound page

  I was only objecting on the grounds that
  the last time we looked at it, it was major VM surgery.  Can someone
  give a summary of how far we are away from being able to do this with
  the VM system today and what extra work is needed (and how big is this
  piece of work)?
  
 
 Offhand no idea. For fsblock, probably a similar amount of work than
 had to be done in 2007 and I'd expect it would still require filesystem
 awareness problems that Dave Chinner pointer out earlier. For large block,
 it'd hit into the same wall that allocations must always succeed. If we
 want to break the connection between the basic unit of memory managed
 by the kernel and the MMU page size then I don't know but it would be a
 fairly large amount of surgery and need a lot of design work.

Here's the patch that Christoph wrote backin 2007 to add PAGE_SIZE
based mmap 

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Christoph Lameter
On Wed, 22 Jan 2014, Mel Gorman wrote:

 Don't get me wrong, I'm interested in the topic but I severely doubt I'd
 have the capacity to research the background of this in advance. It's also
 unlikely that I'd work on it in the future without throwing out my current
 TODO list. In an ideal world someone will have done the legwork in advance
 of LSF/MM to help drive the topic.

I can give an overview of the history and the challenges of the approaches
if needed.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Christoph Lameter
On Wed, 22 Jan 2014, Mel Gorman wrote:

 Large block support was proposed years ago by Christoph Lameter
 (http://lwn.net/Articles/232757/). I think I was just getting started
 in the community at the time so I do not recall any of the details. I do
 believe it motivated an alternative by Nick Piggin called fsblock though
 (http://lwn.net/Articles/321390/). At the very least it would be nice to
 know why neither were never merged for those of us that were not around
 at the time and who may not have the chance to dive through mailing list
 archives between now and March.

It was rejected first because of the necessity of higher order page
allocations. Nick and I then added ways to virtually map higher order
pages if the page allocator could no longe provide those.

All of this required changes to the basic page cache operations. I added a
way for the mapping to indicate an order for an address range and then
modified the page cache operations to be able to operate on any order
pages.

The patchset that introduced the ability to specify different orders for
the pagecache address ranges was not accepted by Andrew because he thought
there was no chance for the rest of the modifications to become
acceptable.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Christoph Lameter
On Thu, 23 Jan 2014, James Bottomley wrote:

 If the compound page infrastructure exists today and is usable for this,
 what else do we need to do? ... because if it's a couple of trivial
 changes and a few minor patches to filesystems to take advantage of it,
 we might as well do it anyway.  I was only objecting on the grounds that
 the last time we looked at it, it was major VM surgery.  Can someone
 give a summary of how far we are away from being able to do this with
 the VM system today and what extra work is needed (and how big is this
 piece of work)?

The main problem for me was the page cache. The VM would not be such a
problem. Changing the page cache function required updates to many
filesystems.


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Joel Becker
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote:
 On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
   
   I expect it would be relatively simple to get large blocksizes working
   on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
   amounts of work, perhaps someone can do a proof-of-concept on powerpc
   (or ia64) with 64k blocksize.
  
  Reality check: 64k block sizes on 64k page Linux machines has been
  used in production on XFS for at least 10 years. It's exactly the
  same case as 4k block size on 4k page size - one page, one buffer
  head, one filesystem block.
 
 This is true for ext4 as well.  Block size == page size support is
 pretty easy; the hard part is when block size  page size, due to
 assumptions in the VM layer that requires that FS system needs to do a
 lot of extra work to fudge around.  So the real problem comes with
 trying to support 64k block sizes on a 4k page architecture, and can
 we do it in a way where every single file system doesn't have to do
 their own specific hacks to work around assumptions made in the VM
 layer.

Yup, ditto for ocfs2.

Joel

-- 

One of the symptoms of an approaching nervous breakdown is the
 belief that one's work is terribly important.
 - Bertrand Russell 

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Joel Becker
On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
   On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
 [agreement cut because it's boring for the reader]
   Realistically, if you look at what the I/O schedulers output on a
   standard (spinning rust) workload, it's mostly large transfers.
   Obviously these are misalgned at the ends, but we can fix some of that
   in the scheduler.  Particularly if the FS helps us with layout.  My
   instinct tells me that we can fix 99% of this with layout on the FS + io
   schedulers ... the remaining 1% goes to the drive as needing to do RMW
   in the device, but the net impact to our throughput shouldn't be that
   great.
  
  There are a few workloads where the VM and the FS would team up to make
  this fairly miserable
  
  Small files.  Delayed allocation fixes a lot of this, but the VM doesn't
  realize that fileA, fileB, fileC, and fileD all need to be written at
  the same time to avoid RMW.  Btrfs and MD have setup plugging callbacks
  to accumulate full stripes as much as possible, but it still hurts.
  
  Metadata.  These writes are very latency sensitive and we'll gain a lot
  if the FS is explicitly trying to build full sector IOs.
 
 OK, so these two cases I buy ... the question is can we do something
 about them today without increasing the block size?
 
 The metadata problem, in particular, might be block independent: we
 still have a lot of small chunks to write out at fractured locations.
 With a large block size, the FS knows it's been bad and can expect the
 rolled up newspaper, but it's not clear what it could do about it.
 
 The small files issue looks like something we should be tackling today
 since writing out adjacent files would actually help us get bigger
 transfers.

ocfs2 can actually take significant advantage here, because we store
small file data in-inode.  This would grow our in-inode size from ~3K to
~15K or ~63K.  We'd actually have to do more work to start putting more
than one inode in a block (thought that would be a promising avenue too
once the coordination is solved generically.

Joel


-- 

One of the symptoms of an approaching nervous breakdown is the
 belief that one's work is terribly important.
 - Bertrand Russell 

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Chris Mason
On Thu, 2014-01-23 at 13:27 -0800, Joel Becker wrote:
 On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote:
  On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote:
   On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
  [agreement cut because it's boring for the reader]
Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler.  Particularly if the FS helps us with layout.  My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.
   
   There are a few workloads where the VM and the FS would team up to make
   this fairly miserable
   
   Small files.  Delayed allocation fixes a lot of this, but the VM doesn't
   realize that fileA, fileB, fileC, and fileD all need to be written at
   the same time to avoid RMW.  Btrfs and MD have setup plugging callbacks
   to accumulate full stripes as much as possible, but it still hurts.
   
   Metadata.  These writes are very latency sensitive and we'll gain a lot
   if the FS is explicitly trying to build full sector IOs.
  
  OK, so these two cases I buy ... the question is can we do something
  about them today without increasing the block size?
  
  The metadata problem, in particular, might be block independent: we
  still have a lot of small chunks to write out at fractured locations.
  With a large block size, the FS knows it's been bad and can expect the
  rolled up newspaper, but it's not clear what it could do about it.
  
  The small files issue looks like something we should be tackling today
  since writing out adjacent files would actually help us get bigger
  transfers.
 
 ocfs2 can actually take significant advantage here, because we store
 small file data in-inode.  This would grow our in-inode size from ~3K to
 ~15K or ~63K.  We'd actually have to do more work to start putting more
 than one inode in a block (thought that would be a promising avenue too
 once the coordination is solved generically.

Btrfs already defaults to 16K metadata and can go as high as 64k.  The
part we don't do is multi-page sectors for data blocks.

I'd tend to leverage the read/modify/write engine from the raid code for
that.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] ia64 simscsi: fix race condition and simplify the code

2014-01-23 Thread Mikulas Patocka
The simscsi driver processes the requests in the request routine and then
offloads the completion callback to a tasklet. This is buggy because there
is parallel unsynchronized access to the completion queue from the request
routine and from the tasklet.

With current SCSI architecture, requests can be completed directly from
the requets routine. So I removed the tasklet code.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 arch/ia64/hp/sim/simscsi.c |   34 ++
 1 file changed, 2 insertions(+), 32 deletions(-)

Index: linux-2.6-ia64/arch/ia64/hp/sim/simscsi.c
===
--- linux-2.6-ia64.orig/arch/ia64/hp/sim/simscsi.c  2014-01-24 
01:23:08.0 +0100
+++ linux-2.6-ia64/arch/ia64/hp/sim/simscsi.c   2014-01-24 01:26:16.0 
+0100
@@ -47,9 +47,6 @@
 
 static struct Scsi_Host *host;
 
-static void simscsi_interrupt (unsigned long val);
-static DECLARE_TASKLET(simscsi_tasklet, simscsi_interrupt, 0);
-
 struct disk_req {
unsigned long addr;
unsigned len;
@@ -64,13 +61,6 @@ static int desc[16] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
 };
 
-static struct queue_entry {
-   struct scsi_cmnd *sc;
-} queue[SIMSCSI_REQ_QUEUE_LEN];
-
-static int rd, wr;
-static atomic_t num_reqs = ATOMIC_INIT(0);
-
 /* base name for default disks */
 static char *simscsi_root = DEFAULT_SIMSCSI_ROOT;
 
@@ -95,21 +85,6 @@ simscsi_setup (char *s)
 
 __setup(simscsi=, simscsi_setup);
 
-static void
-simscsi_interrupt (unsigned long val)
-{
-   struct scsi_cmnd *sc;
-
-   while ((sc = queue[rd].sc) != NULL) {
-   atomic_dec(num_reqs);
-   queue[rd].sc = NULL;
-   if (DBG)
-   printk(simscsi_interrupt: done with %ld\n, 
sc-serial_number);
-   (*sc-scsi_done)(sc);
-   rd = (rd + 1) % SIMSCSI_REQ_QUEUE_LEN;
-   }
-}
-
 static int
 simscsi_biosparam (struct scsi_device *sdev, struct block_device *n,
sector_t capacity, int ip[])
@@ -315,14 +290,9 @@ simscsi_queuecommand_lck (struct scsi_cm
sc-sense_buffer[0] = 0x70;
sc-sense_buffer[2] = 0x00;
}
-   if (atomic_read(num_reqs) = SIMSCSI_REQ_QUEUE_LEN) {
-   panic(Attempt to queue command while command is pending!!);
-   }
-   atomic_inc(num_reqs);
-   queue[wr].sc = sc;
-   wr = (wr + 1) % SIMSCSI_REQ_QUEUE_LEN;
 
-   tasklet_schedule(simscsi_tasklet);
+   (*sc-scsi_done)(sc);
+
return 0;
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM ATTEND] interest in blk-mq, scsi-mq, dm-cache, dm-thinp, dm-*

2014-01-23 Thread Mike Christie
On 01/13/2014 05:36 AM, Hannes Reinecke wrote:
 On 01/10/2014 07:27 PM, Mike Snitzer wrote:
 I would like to attend to participate in discussions related to topics
 listed in the subject.  As a maintainer of DM I'd be interested to
 learn/discuss areas that should become a development focus in the months
 following LSF.

 +1
 
 I've been thinking on (re-) implementing multipathing on top of
 blk-mq, and would like to discuss the probability of which.
 There are some design decisions in blk-mq (eg statically allocating
 the number of queues) which do not play well with that.
 

I think I have been thinking about going a completely different direction.

The thing about dm-multipath is that request based adds the extra queue
locking and that of course is bad. In our testing it is a major perf
issue. We got things like ioscheduling though.

If we went back to bio based multipathing then it turns out that when
scsi also supports multiqueue then it all works pretty nicely. There is
room for improvement in general like with some dm allocations being
numa/cpu aware, but the request_queue locking issues we have go away and
it is very simple code wise.

We could go the route of making request based dm-multipath:

1. aware of underlying multiqueue devices. So just basically keep what
we have more or less but then have dm-multipath make a request that can
be sent to a multiqueue device then call blk_mq_insert_request. This
would all be hidden by nice interfaces that hide if it is multiqueue
underlying device or not.

2. make dm-multipath do multiqueue (so implement map_queue, queue_rq,
etc) and also making it aware of underlying multiqueue devices.

#1 just keeps the existing request spin_lock problem so there is not
much point other than just getting things working.

#2 is a good deal of work and what does it end up buying us over just
making multipath bio based. We lose iosched support. If we are going to
make advanced multiqueue ioschedulers that rely on request structs then
#2 could be useful.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM ATTEND] interest in blk-mq, scsi-mq, dm-cache, dm-thinp, dm-*

2014-01-23 Thread Hannes Reinecke
On 01/24/2014 03:37 AM, Mike Christie wrote:
 On 01/13/2014 05:36 AM, Hannes Reinecke wrote:
 On 01/10/2014 07:27 PM, Mike Snitzer wrote:
 I would like to attend to participate in discussions related to topics
 listed in the subject.  As a maintainer of DM I'd be interested to
 learn/discuss areas that should become a development focus in the months
 following LSF.

 +1

 I've been thinking on (re-) implementing multipathing on top of
 blk-mq, and would like to discuss the probability of which.
 There are some design decisions in blk-mq (eg statically allocating
 the number of queues) which do not play well with that.

 
 I think I have been thinking about going a completely different direction.
 
 The thing about dm-multipath is that request based adds the extra queue
 locking and that of course is bad. In our testing it is a major perf
 issue. We got things like ioscheduling though.
 
Indeed. And without that we cannot do true load-balancing.

 If we went back to bio based multipathing then it turns out that when
 scsi also supports multiqueue then it all works pretty nicely. There is
 room for improvement in general like with some dm allocations being
 numa/cpu aware, but the request_queue locking issues we have go away and
 it is very simple code wise.
 
If and when.

The main issue I see with that is that it might take some time (if
ever) for SCSI LLDDs to go fully multiqueue.
In fact, I strongly suspect that only newer LLDDs will ever support
multiqueue; for the older cards the HW interface it too much tied
to single queue operations.

 We could go the route of making request based dm-multipath:
 
 1. aware of underlying multiqueue devices. So just basically keep what
 we have more or less but then have dm-multipath make a request that can
 be sent to a multiqueue device then call blk_mq_insert_request. This
 would all be hidden by nice interfaces that hide if it is multiqueue
 underlying device or not.
 
 2. make dm-multipath do multiqueue (so implement map_queue, queue_rq,
 etc) and also making it aware of underlying multiqueue devices.
 
 #1 just keeps the existing request spin_lock problem so there is not
 much point other than just getting things working.
 
 #2 is a good deal of work and what does it end up buying us over just
 making multipath bio based. We lose iosched support. If we are going to
 make advanced multiqueue ioschedulers that rely on request structs then
 #2 could be useful.
 

Obviously we need iosched support when going multiqueue.
I wouldn't dream of dropping them.

So my overall idea here is to move multipath over to block-mq,
making each path identical to one queue.
(As mentioned above, currently every single FC HBA exposes a single
HW queue anyway)
The ioschedulers would be moved to the map_queue function.

This approach has several issues which I would like to discuss:
- block-mq ctx allocation currently is static. This doesn't play
  well with multipathing, were paths (=queues) might get configured
  on-the-fly.
- Queues might be coming from different HBAs; one would need to
  audit the block-mq stuff if that's possible.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/6] megaraid_sas: Do not wait forever

2014-01-23 Thread Desai, Kashyap
Hannes:

We have already worked on wait_event usage in megasas_issue_blocked_cmd. 
That code will be posted  by LSI once we received test result from LSI Q/A team.

If you see the current OCR code in Linux Driver we do re-send the IOCTL 
command.
MR product does not want IOCTL timeout due to some reason. That is why even if 
FW faulted, Driver will do OCR and re-send all existing Management commands 
(IOCTL comes under management commands).

Just for info. (see below snippet in  OCR code)

/* Re-fire management commands */
for (j = 0 ; j  instance-max_fw_cmds; j++) {
cmd_fusion = fusion-cmd_list[j];
if (cmd_fusion-sync_cmd_idx != (u32)ULONG_MAX) 
{
cmd_mfi = 
instance-cmd_list[cmd_fusion-sync_cmd_idx];
if (cmd_mfi-frame-dcmd.opcode == 
MR_DCMD_LD_MAP_GET_INFO) {
megasas_return_cmd(instance, 
cmd_mfi);

megasas_return_cmd_fusion(instance, cmd_fusion);



Current MR Driver is not designed to add timeout for DCMD and IOCTL path. [ 
I added timeout only for limited DCMDs, which are harmless to continue after 
timeout ]

As of now, you can skip this patch and we will be submitting patch to fix 
similar issue.
But note, we cannot add complete wait_event_timeout due to day-1 design, but 
will try to cover wait_event_timout for some valid cases.

` Kashyap

 -Original Message-
 From: Hannes Reinecke [mailto:h...@suse.de]
 Sent: Thursday, January 16, 2014 3:56 PM
 To: James Bottomley
 Cc: linux-scsi@vger.kernel.org; Hannes Reinecke; Desai, Kashyap; Adam
 Radford
 Subject: [PATCH 1/6] megaraid_sas: Do not wait forever
 
 If the firmware is incommunicado for whatever reason the driver will wait
 forever during initialisation, causing all sorts of hangcheck timers to 
 trigger.
 We should rather wait for a defined time, and give up on the command if no
 response was received.
 
 Cc: Kashyap Desai kashyap.de...@lsi.com
 Cc: Adam Radford aradf...@gmail.com
 Signed-off-by: Hannes Reinecke h...@suse.de
 ---
  drivers/scsi/megaraid/megaraid_sas_base.c | 43 ++
 -
  1 file changed, 25 insertions(+), 18 deletions(-)
 
 diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
 b/drivers/scsi/megaraid/megaraid_sas_base.c
 index 3b7ad10..95d4e5c 100644
 --- a/drivers/scsi/megaraid/megaraid_sas_base.c
 +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
 @@ -911,9 +911,11 @@ megasas_issue_blocked_cmd(struct
 megasas_instance *instance,
 
   instance-instancet-issue_dcmd(instance, cmd);
 
 - wait_event(instance-int_cmd_wait_q, cmd-cmd_status !=
 ENODATA);
 + wait_event_timeout(instance-int_cmd_wait_q,
 +cmd-cmd_status != ENODATA,
 +MEGASAS_INTERNAL_CMD_WAIT_TIME * HZ);
 
 - return 0;
 + return cmd-cmd_status == ENODATA ? -ENODATA : 0;
  }
 
  /**
 @@ -932,11 +934,12 @@ megasas_issue_blocked_abort_cmd(struct
 megasas_instance *instance,  {
   struct megasas_cmd *cmd;
   struct megasas_abort_frame *abort_fr;
 + int status;
 
   cmd = megasas_get_cmd(instance);
 
   if (!cmd)
 - return -1;
 + return -ENOMEM;
 
   abort_fr = cmd-frame-abort;
 
 @@ -960,11 +963,14 @@ megasas_issue_blocked_abort_cmd(struct
 megasas_instance *instance,
   /*
* Wait for this cmd to complete
*/
 - wait_event(instance-abort_cmd_wait_q, cmd-cmd_status !=
 0xFF);
 + wait_event_timeout(instance-abort_cmd_wait_q,
 +cmd-cmd_status != 0xFF,
 +MEGASAS_INTERNAL_CMD_WAIT_TIME * HZ);
   cmd-sync_cmd = 0;
 + status = cmd-cmd_status;
 
   megasas_return_cmd(instance, cmd);
 - return 0;
 + return status == 0xFF ? -ENODATA : 0;
  }
 
  /**
 @@ -3902,6 +3908,7 @@ megasas_get_seq_num(struct megasas_instance
 *instance,
   struct megasas_dcmd_frame *dcmd;
   struct megasas_evt_log_info *el_info;
   dma_addr_t el_info_h = 0;
 + int rc;
 
   cmd = megasas_get_cmd(instance);
 
 @@ -3933,23 +3940,23 @@ megasas_get_seq_num(struct
 megasas_instance *instance,
   dcmd-sgl.sge32[0].phys_addr = cpu_to_le32(el_info_h);
   dcmd-sgl.sge32[0].length = cpu_to_le32(sizeof(struct
 megasas_evt_log_info));
 
 - megasas_issue_blocked_cmd(instance, cmd);
 -
 - /*
 -  * Copy the data back into callers buffer
 -  */
 - eli-newest_seq_num = le32_to_cpu(el_info-newest_seq_num);
 - eli-oldest_seq_num = le32_to_cpu(el_info-oldest_seq_num);
 - eli-clear_seq_num = le32_to_cpu(el_info-clear_seq_num);
 - eli-shutdown_seq_num = le32_to_cpu(el_info-
 shutdown_seq_num);
 - eli-boot_seq_num = le32_to_cpu(el_info-boot_seq_num);
 -
 + rc = megasas_issue_blocked_cmd(instance, cmd);
 + if (!rc) {
 + /*
 +  * Copy the