Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-30 Thread Dave Chinner
On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote:
> 
> 
> - Original Message -
> > From: "Dave Chinner" 
> > To: "Zach Brown" 
> > Cc: "Abhijith Das" , linux-kernel@vger.kernel.org, 
> > "linux-fsdevel" ,
> > "cluster-devel" 
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
> > syscalls
> > 
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > > 
> > > > The topic of a readdirplus-like syscall had come up for discussion at
> > > > last year's
> > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > > implementations
> > > > to get at a directory's entries as well as stat() info on the individual
> > > > inodes.
> > > > I'm presenting these patches and some early test results on a 
> > > > single-node
> > > > GFS2
> > > > filesystem.
> > > > 
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > > 
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > > getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves. As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent..
> > 
> 
> I don't see how the sorting of the inode reads in disk block order can be
> accomplished in userland without knowing the fs-specific topology.

I didn't say anything about doing "disk block ordering" in
userspace. disk block ordering can be done by the IO scheduler and
that's simple enough to do by multithreading and dispatch a few tens
of stat() calls at once

> From my
> observations, I've seen that the performance gain is the most when we can
> order the reads such that seek times are minimized on rotational media.

Yup, which is done by ensuring that we drive deep IO queues rather
than issuing a single IO at a time and waiting for completion before
issuing the next one. This can easily be done from userspace.

> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.

Depends. if the overhead of executing readahead is higher than the time spent
waiting for IO completion, then it will reduce performance. i.e. the
faster the underlying storage, the less CPU time we want to spend on
IO. Readahead generally increases CPU time per object that needs to
be retrieved from disk, and so on high IOP devices there's a really
good chance we don't want readahead like this at all.

i.e. this is yet another reason directory traversal readahead should
be driven from userspace so the policy can be easily controlled by
the application and/or user

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-30 Thread Dave Chinner
On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote:
> On Jul 25, 2014, at 6:38 PM, Dave Chinner  wrote:
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>> Hi all,
> >>> 
> >>> The topic of a readdirplus-like syscall had come up for discussion at 
> >>> last year's
> >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
> >>> implementations
> >>> to get at a directory's entries as well as stat() info on the individual 
> >>> inodes.
> >>> I'm presenting these patches and some early test results on a single-node 
> >>> GFS2
> >>> filesystem.
> >>> 
> >>> 1. dirreadahead() - This patchset is very simple compared to the 
> >>> xgetdents() system
> >>> call below and scales very well for large directories in GFS2. 
> >>> dirreadahead() is
> >>> designed to be called prior to getdents+stat operations.
> >> 
> >> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >> getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves.
> 
> Sure.
> 
> > As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent..
> 
> That assumes sorting by inode number maps to sorting by disk order.
> That isn't always true.

That's true, but it's a fair bet that roughly ascending inode number
ordering is going to be better than random ordering for most
filesystems.

Besides, ordering isn't the real problem - the real problem is the
latency caused by having to do the inode IO synchronously one stat()
at a time. Just multithread the damn thing in userspace so the
stat()s can be done asynchronously and hence be more optimally
ordered by the IO scheduler and completed before the application
blocks on the IO.

It doesn't even need completion synchronisation - the stat()
issued by the application will block until the async stat()
completes the process of bringing the inode into the kernel cache...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-30 Thread Dave Chinner
On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote:
 On Jul 25, 2014, at 6:38 PM, Dave Chinner da...@fromorbit.com wrote:
  On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
  On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
  Hi all,
  
  The topic of a readdirplus-like syscall had come up for discussion at 
  last year's
  LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
  implementations
  to get at a directory's entries as well as stat() info on the individual 
  inodes.
  I'm presenting these patches and some early test results on a single-node 
  GFS2
  filesystem.
  
  1. dirreadahead() - This patchset is very simple compared to the 
  xgetdents() system
  call below and scales very well for large directories in GFS2. 
  dirreadahead() is
  designed to be called prior to getdents+stat operations.
  
  Hmm.  Have you tried plumbing these read-ahead calls in under the normal
  getdents() syscalls?
  
  The issue is not directory block readahead (which some filesystems
  like XFS already have), but issuing inode readahead during the
  getdents() syscall.
  
  It's the semi-random, interleaved inode IO that is being optimised
  here (i.e. queued, ordered, issued, cached), not the directory
  blocks themselves.
 
 Sure.
 
  As such, why does this need to be done in the
  kernel?  This can all be done in userspace, and even hidden within
  the readdir() or ftw/ntfw() implementations themselves so it's OS,
  kernel and filesystem independent..
 
 That assumes sorting by inode number maps to sorting by disk order.
 That isn't always true.

That's true, but it's a fair bet that roughly ascending inode number
ordering is going to be better than random ordering for most
filesystems.

Besides, ordering isn't the real problem - the real problem is the
latency caused by having to do the inode IO synchronously one stat()
at a time. Just multithread the damn thing in userspace so the
stat()s can be done asynchronously and hence be more optimally
ordered by the IO scheduler and completed before the application
blocks on the IO.

It doesn't even need completion synchronisation - the stat()
issued by the application will block until the async stat()
completes the process of bringing the inode into the kernel cache...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-30 Thread Dave Chinner
On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote:
 
 
 - Original Message -
  From: Dave Chinner da...@fromorbit.com
  To: Zach Brown z...@redhat.com
  Cc: Abhijith Das a...@redhat.com, linux-kernel@vger.kernel.org, 
  linux-fsdevel linux-fsde...@vger.kernel.org,
  cluster-devel cluster-de...@redhat.com
  Sent: Friday, July 25, 2014 7:38:59 PM
  Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
  syscalls
  
  On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
   On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
Hi all,

The topic of a readdirplus-like syscall had come up for discussion at
last year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
implementations
to get at a directory's entries as well as stat() info on the individual
inodes.
I'm presenting these patches and some early test results on a 
single-node
GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the
xgetdents() system
call below and scales very well for large directories in GFS2.
dirreadahead() is
designed to be called prior to getdents+stat operations.
   
   Hmm.  Have you tried plumbing these read-ahead calls in under the normal
   getdents() syscalls?
  
  The issue is not directory block readahead (which some filesystems
  like XFS already have), but issuing inode readahead during the
  getdents() syscall.
  
  It's the semi-random, interleaved inode IO that is being optimised
  here (i.e. queued, ordered, issued, cached), not the directory
  blocks themselves. As such, why does this need to be done in the
  kernel?  This can all be done in userspace, and even hidden within
  the readdir() or ftw/ntfw() implementations themselves so it's OS,
  kernel and filesystem independent..
  
 
 I don't see how the sorting of the inode reads in disk block order can be
 accomplished in userland without knowing the fs-specific topology.

I didn't say anything about doing disk block ordering in
userspace. disk block ordering can be done by the IO scheduler and
that's simple enough to do by multithreading and dispatch a few tens
of stat() calls at once

 From my
 observations, I've seen that the performance gain is the most when we can
 order the reads such that seek times are minimized on rotational media.

Yup, which is done by ensuring that we drive deep IO queues rather
than issuing a single IO at a time and waiting for completion before
issuing the next one. This can easily be done from userspace.

 I have not tested my patches against SSDs, but my guess would be that the
 performance impact would be minimal, if any.

Depends. if the overhead of executing readahead is higher than the time spent
waiting for IO completion, then it will reduce performance. i.e. the
faster the underlying storage, the less CPU time we want to spend on
IO. Readahead generally increases CPU time per object that needs to
be retrieved from disk, and so on high IOP devices there's a really
good chance we don't want readahead like this at all.

i.e. this is yet another reason directory traversal readahead should
be driven from userspace so the policy can be easily controlled by
the application and/or user

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Andreas Dilger
On Jul 25, 2014, at 6:38 PM, Dave Chinner  wrote:
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
>> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>>> Hi all,
>>> 
>>> The topic of a readdirplus-like syscall had come up for discussion at last 
>>> year's
>>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
>>> implementations
>>> to get at a directory's entries as well as stat() info on the individual 
>>> inodes.
>>> I'm presenting these patches and some early test results on a single-node 
>>> GFS2
>>> filesystem.
>>> 
>>> 1. dirreadahead() - This patchset is very simple compared to the 
>>> xgetdents() system
>>> call below and scales very well for large directories in GFS2. 
>>> dirreadahead() is
>>> designed to be called prior to getdents+stat operations.
>> 
>> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
>> getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves.

Sure.

> As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent..

That assumes sorting by inode number maps to sorting by disk order.
That isn't always true.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Zuckerman, Boris
2 years ago I had that type of functionality implemented for Ibrix. It included 
readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply 
detected readdir+ like interest on VFS level and pushed a wave of populating 
directory caches and plugging in dentry cache entries. It improved productivity 
of NFS readdir+ and SMB QueryDirectories more than 4x.

Regards, Boris



> -Original Message-
> From: linux-fsdevel-ow...@vger.kernel.org [mailto:linux-fsdevel-
> ow...@vger.kernel.org] On Behalf Of Abhijith Das
> Sent: Monday, July 28, 2014 8:22 AM
> To: Dave Chinner
> Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
> syscalls
> 
> 
> 
> - Original Message -
> > From: "Dave Chinner" 
> > To: "Zach Brown" 
> > Cc: "Abhijith Das" , linux-kernel@vger.kernel.org,
> > "linux-fsdevel" , "cluster-devel"
> > 
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs
> > dirreadahead syscalls
> >
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > >
> > > > The topic of a readdirplus-like syscall had come up for discussion
> > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls
> > > > with their GFS2 implementations to get at a directory's entries as
> > > > well as stat() info on the individual inodes.
> > > > I'm presenting these patches and some early test results on a
> > > > single-node
> > > > GFS2
> > > > filesystem.
> > > >
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > >
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the
> > > normal
> > > getdents() syscalls?
> >
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> >
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory blocks
> > themselves. As such, why does this need to be done in the kernel?
> > This can all be done in userspace, and even hidden within the
> > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel
> > and filesystem independent..
> >
> 
> I don't see how the sorting of the inode reads in disk block order can be 
> accomplished in
> userland without knowing the fs-specific topology. From my observations, I've 
> seen that
> the performance gain is the most when we can order the reads such that seek 
> times are
> minimized on rotational media.
> 
> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in 
> the body of a
> message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Abhijith Das


- Original Message -
> From: "Dave Chinner" 
> To: "Zach Brown" 
> Cc: "Abhijith Das" , linux-kernel@vger.kernel.org, 
> "linux-fsdevel" ,
> "cluster-devel" 
> Sent: Friday, July 25, 2014 7:38:59 PM
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
> syscalls
> 
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > Hi all,
> > > 
> > > The topic of a readdirplus-like syscall had come up for discussion at
> > > last year's
> > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > implementations
> > > to get at a directory's entries as well as stat() info on the individual
> > > inodes.
> > > I'm presenting these patches and some early test results on a single-node
> > > GFS2
> > > filesystem.
> > > 
> > > 1. dirreadahead() - This patchset is very simple compared to the
> > > xgetdents() system
> > > call below and scales very well for large directories in GFS2.
> > > dirreadahead() is
> > > designed to be called prior to getdents+stat operations.
> > 
> > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves. As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent..
> 

I don't see how the sorting of the inode reads in disk block order can be
accomplished in userland without knowing the fs-specific topology. From my
observations, I've seen that the performance gain is the most when we can
order the reads such that seek times are minimized on rotational media.

I have not tested my patches against SSDs, but my guess would be that the
performance impact would be minimal, if any.

Cheers!
--Abhi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Abhijith Das


- Original Message -
 From: Dave Chinner da...@fromorbit.com
 To: Zach Brown z...@redhat.com
 Cc: Abhijith Das a...@redhat.com, linux-kernel@vger.kernel.org, 
 linux-fsdevel linux-fsde...@vger.kernel.org,
 cluster-devel cluster-de...@redhat.com
 Sent: Friday, July 25, 2014 7:38:59 PM
 Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
 syscalls
 
 On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
  On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
   Hi all,
   
   The topic of a readdirplus-like syscall had come up for discussion at
   last year's
   LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
   implementations
   to get at a directory's entries as well as stat() info on the individual
   inodes.
   I'm presenting these patches and some early test results on a single-node
   GFS2
   filesystem.
   
   1. dirreadahead() - This patchset is very simple compared to the
   xgetdents() system
   call below and scales very well for large directories in GFS2.
   dirreadahead() is
   designed to be called prior to getdents+stat operations.
  
  Hmm.  Have you tried plumbing these read-ahead calls in under the normal
  getdents() syscalls?
 
 The issue is not directory block readahead (which some filesystems
 like XFS already have), but issuing inode readahead during the
 getdents() syscall.
 
 It's the semi-random, interleaved inode IO that is being optimised
 here (i.e. queued, ordered, issued, cached), not the directory
 blocks themselves. As such, why does this need to be done in the
 kernel?  This can all be done in userspace, and even hidden within
 the readdir() or ftw/ntfw() implementations themselves so it's OS,
 kernel and filesystem independent..
 

I don't see how the sorting of the inode reads in disk block order can be
accomplished in userland without knowing the fs-specific topology. From my
observations, I've seen that the performance gain is the most when we can
order the reads such that seek times are minimized on rotational media.

I have not tested my patches against SSDs, but my guess would be that the
performance impact would be minimal, if any.

Cheers!
--Abhi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Zuckerman, Boris
2 years ago I had that type of functionality implemented for Ibrix. It included 
readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply 
detected readdir+ like interest on VFS level and pushed a wave of populating 
directory caches and plugging in dentry cache entries. It improved productivity 
of NFS readdir+ and SMB QueryDirectories more than 4x.

Regards, Boris



 -Original Message-
 From: linux-fsdevel-ow...@vger.kernel.org [mailto:linux-fsdevel-
 ow...@vger.kernel.org] On Behalf Of Abhijith Das
 Sent: Monday, July 28, 2014 8:22 AM
 To: Dave Chinner
 Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel
 Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead 
 syscalls
 
 
 
 - Original Message -
  From: Dave Chinner da...@fromorbit.com
  To: Zach Brown z...@redhat.com
  Cc: Abhijith Das a...@redhat.com, linux-kernel@vger.kernel.org,
  linux-fsdevel linux-fsde...@vger.kernel.org, cluster-devel
  cluster-de...@redhat.com
  Sent: Friday, July 25, 2014 7:38:59 PM
  Subject: Re: [RFC] readdirplus implementations: xgetdents vs
  dirreadahead syscalls
 
  On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
   On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
Hi all,
   
The topic of a readdirplus-like syscall had come up for discussion
at last year's LSF/MM collab summit. I wrote a couple of syscalls
with their GFS2 implementations to get at a directory's entries as
well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a
single-node
GFS2
filesystem.
   
1. dirreadahead() - This patchset is very simple compared to the
xgetdents() system
call below and scales very well for large directories in GFS2.
dirreadahead() is
designed to be called prior to getdents+stat operations.
  
   Hmm.  Have you tried plumbing these read-ahead calls in under the
   normal
   getdents() syscalls?
 
  The issue is not directory block readahead (which some filesystems
  like XFS already have), but issuing inode readahead during the
  getdents() syscall.
 
  It's the semi-random, interleaved inode IO that is being optimised
  here (i.e. queued, ordered, issued, cached), not the directory blocks
  themselves. As such, why does this need to be done in the kernel?
  This can all be done in userspace, and even hidden within the
  readdir() or ftw/ntfw() implementations themselves so it's OS, kernel
  and filesystem independent..
 
 
 I don't see how the sorting of the inode reads in disk block order can be 
 accomplished in
 userland without knowing the fs-specific topology. From my observations, I've 
 seen that
 the performance gain is the most when we can order the reads such that seek 
 times are
 minimized on rotational media.
 
 I have not tested my patches against SSDs, but my guess would be that the
 performance impact would be minimal, if any.
 
 Cheers!
 --Abhi
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in 
 the body of a
 message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-28 Thread Andreas Dilger
On Jul 25, 2014, at 6:38 PM, Dave Chinner da...@fromorbit.com wrote:
 On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
 On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
 Hi all,
 
 The topic of a readdirplus-like syscall had come up for discussion at last 
 year's
 LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
 implementations
 to get at a directory's entries as well as stat() info on the individual 
 inodes.
 I'm presenting these patches and some early test results on a single-node 
 GFS2
 filesystem.
 
 1. dirreadahead() - This patchset is very simple compared to the 
 xgetdents() system
 call below and scales very well for large directories in GFS2. 
 dirreadahead() is
 designed to be called prior to getdents+stat operations.
 
 Hmm.  Have you tried plumbing these read-ahead calls in under the normal
 getdents() syscalls?
 
 The issue is not directory block readahead (which some filesystems
 like XFS already have), but issuing inode readahead during the
 getdents() syscall.
 
 It's the semi-random, interleaved inode IO that is being optimised
 here (i.e. queued, ordered, issued, cached), not the directory
 blocks themselves.

Sure.

 As such, why does this need to be done in the
 kernel?  This can all be done in userspace, and even hidden within
 the readdir() or ftw/ntfw() implementations themselves so it's OS,
 kernel and filesystem independent..

That assumes sorting by inode number maps to sorting by disk order.
That isn't always true.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Dave Chinner
On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > Hi all,
> > 
> > The topic of a readdirplus-like syscall had come up for discussion at last 
> > year's
> > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
> > implementations
> > to get at a directory's entries as well as stat() info on the individual 
> > inodes.
> > I'm presenting these patches and some early test results on a single-node 
> > GFS2
> > filesystem.
> > 
> > 1. dirreadahead() - This patchset is very simple compared to the 
> > xgetdents() system
> > call below and scales very well for large directories in GFS2. 
> > dirreadahead() is
> > designed to be called prior to getdents+stat operations.
> 
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?

The issue is not directory block readahead (which some filesystems
like XFS already have), but issuing inode readahead during the
getdents() syscall.

It's the semi-random, interleaved inode IO that is being optimised
here (i.e. queued, ordered, issued, cached), not the directory
blocks themselves. As such, why does this need to be done in the
kernel?  This can all be done in userspace, and even hidden within
the readdir() or ftw/ntfw() implementations themselves so it's OS,
kernel and filesystem independent..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Trond Myklebust
On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse  wrote:
> Hi,
>
>
> On 25/07/14 19:28, Zach Brown wrote:
>>
>
>> How do the file systems that implement directory read-ahead today deal
>> with this?
>
> I don't know of one that does - or at least readahead of the directory info
> itself is one thing (which is relatively easy, and done by many file
> systems) its reading ahead the inodes within the directory which is more
> complex, and what we are talking about here.
>

NFS looks at whether or not there are lookup revalidations and/or
getattr calls in between the calls to readdir(). If there are, then we
assume an 'ls -l' workload, and continue to issue readdirplus calls to
the server.

Note that we also actively zap the readdir cache if we see getattr
calls over the wire, since the single call to readdirplus is usually
very much more efficient.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.mykleb...@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Steven Whitehouse

Hi,

On 25/07/14 19:28, Zach Brown wrote:

On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:

Hi,

On 25/07/14 18:52, Zach Brown wrote:

[snip]

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z


Well I'm not sure thats entirely true... we have readahead() and we also
have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.


doubt, but how would we tell getdents64() when we were going to read the
inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?
In the file readahead case it has some context, and thats stored in the 
struct file. Thats where the problem lies in this case, the struct file 
relates to the directory, and when we then call open, or stat or 
whatever on some file within that directory, we don't pass the 
directory's fd to that open call, so we don't have a context to use. We 
could possibly look through the open fds relating to the process that 
called open to see if the parent dir of the inode we are opening is in 
there, in order to find the context to figure out whether to do 
readahead or not, but.. its not very nice to say the least.


I'm very much in agreement that doing this automatically is best, but 
that only works when its possible to get a very good estimate of whether 
the readahead is needed or not. That is much easier for file data than 
it is for inodes in a directory. If someone can figure out how to get 
around this problem though, then that is certainly something we'd like 
to look at.


The problem gets even more tricky in case the user only wants, say, half 
of the inodes in the directory... how does the kernel know which half?


The idea here is really to give some idea of the kind of performance 
gains that we might see with the readahead vs xgetdents approaches, and 
by the sizes of the patches, the relative complexity of the implementations.


I think overall, the readahead approach is the more flexible... if I had 
a directory full of files I wanted to truncate for example, it would be 
possible to use the same readahead to pull in the inodes quickly and 
then issue the truncates to the pre-cached inodes. That is something 
that would not be possible using xgetdents. Whether thats useful for 
real world applications or not remains to be seen, but it does show that 
it can handle more potential use cases than xgetdents. Also the ability 
to only readahead an application specific subset of inodes is a useful 
feature.


There is certainly a discussion to be had about how to specify the 
inodes that are wanted. Using the directory position is a relatively 
easy way to do it, and works well when most of the inodes in a directory 
are wanted. Specifying the file names would work better when fewer 
inodes are wanted, but then if very few are required, is readahead 
likely to give much of a gain anyway?... so thats why we chose the 
approach that we did.



How do the file systems that implement directory read-ahead today deal
with this?
I don't know of one that does - or at least readahead of the directory 
info itself is one thing (which is relatively easy, and done by many 
file systems) its reading ahead the inodes within the directory which is 
more complex, and what we are talking about here.



Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z
Thats perfectly ok - we hoped to generate some discussion and they are 
good questions,


Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Steven Whitehouse

Hi,

On 25/07/14 18:52, Zach Brown wrote:

On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:

Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last 
year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z

Well I'm not sure thats entirely true... we have readahead() and we also 
have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no 
doubt, but how would we tell getdents64() when we were going to read the 
inodes, rather than just the file names? We may only want to readahead 
some subset of the directory entries rather than all of them, so the 
thought was to allow that flexibility by making it, its own syscall,


Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Zach Brown
On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On 25/07/14 18:52, Zach Brown wrote:
> >On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>Hi all,
> >>
> >>The topic of a readdirplus-like syscall had come up for discussion at last 
> >>year's
> >>LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
> >>implementations
> >>to get at a directory's entries as well as stat() info on the individual 
> >>inodes.
> >>I'm presenting these patches and some early test results on a single-node 
> >>GFS2
> >>filesystem.
> >>
> >>1. dirreadahead() - This patchset is very simple compared to the 
> >>xgetdents() system
> >>call below and scales very well for large directories in GFS2. 
> >>dirreadahead() is
> >>designed to be called prior to getdents+stat operations.
> >Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >getdents() syscalls?
> >
> >We don't have a filereadahead() syscall and yet we somehow manage to
> >implement buffered file data read-ahead :).
> >
> >- z
> >
> Well I'm not sure thats entirely true... we have readahead() and we also
> have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.

> doubt, but how would we tell getdents64() when we were going to read the
> inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?

How do the file systems that implement directory read-ahead today deal
with this?

Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Zach Brown
On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> Hi all,
> 
> The topic of a readdirplus-like syscall had come up for discussion at last 
> year's
> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
> implementations
> to get at a directory's entries as well as stat() info on the individual 
> inodes.
> I'm presenting these patches and some early test results on a single-node GFS2
> filesystem.
> 
> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
> system
> call below and scales very well for large directories in GFS2. dirreadahead() 
> is
> designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Abhijith Das
Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last 
year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations. In it's current form, 
it
only speeds up stat() operations by caching the relevant inodes. Support can be
added in the future to cache extended attribute blocks as well. This works by 
first
collecting all the inode numbers of the directory's entries (subject to a 
numeric
or memory cap). This list is sorted by inode disk block order and passed on to
workqueues to perform lookups on the inodes asynchronously to bring them into 
the
cache.

2. xgetdents() - I had posted a version of this patchset some time last year and
it is largely unchanged - I just ported it to the latest upstream kernel. It
allows the user to request a combination of entries, stat and xattrs 
(keys/values)
for a directory. The stat portion is based on David Howells' xstat patchset he
had posted last year as well. I've included the relevant vfs bits in my 
patchset.
xgetdents() in GFS2 works in two phases. In the first phase, it collects all the
dirents by reading the directory in question. In phase two, it reads in inode
blocks and xattr blocks (if requested) for each entry after sorting the disk
accesses in block order. All of the intermediate data is stored in a buffer 
backed
by a vector of pages and is eventually transferred to the user supplied buffer.

Both syscalls perform significantly better than a simple getdents+stat with a 
cold
cache. The main advantage lies in being able to sort disk accesses for a bunch 
of
inodes in advance compared to seeking all over the disk for inodes one entry at 
a
time.

This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the
time taken to get directory entries and their respective stat info by 3 
different
sets of syscalls:

1) getdents+stat ('ls -l', basically) - Solid blue line
2) xgetdents with various buffer size and num_entries limits - Dotted lines
   Eg: v16384 d1 means a limit of 16384 pages for the scratch buffer and
   a maximum of 1 entries to collect at a time.
3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines
   Eg: d1 implies that it would fire off a max of 1 inode lookups during
   each syscall.

numfiles:  1 5 1050

getdents+stat   1.4s  220s   514s 2441s
xgetdents   1.2s   43s75s 1710s
dirreadahead+getdents+stat  1.1s5s68s  391s

Here is a seekwatcher graph from a test run on a directory of 5 files. 
(https://www.dropbox.com/s/fma8d4jzh7365lh/5-combined.png) The comparison is
between getdents+stat and xgetdents. The first set of plots is of getdents+stat,
followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) 
and
num_entries (100 to 100) limits. One can see the effect of ordering the disk
reads in the Disk IO portion of the graphs and the corresponding effect on 
seeks,
throughput and overall time taken.

This second seekwatcher graph similarly shows the 
dirreadahead()+getdents()+stat()
syscall-combo for a 50-file directory with increasing num_entries (100 to
100) limits versus getdents+stat.
(https://www.dropbox.com/s/rrhvamu99th3eae/50-ra_combined_new.png)
The corresponding getdents+stat baseline for this run is at the top of the 
series
of graphs.

I'm posting these two patchsets shortly for comments.

Cheers!
--Abhi

Red Hat Filesystems
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Abhijith Das
Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last 
year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations. In it's current form, 
it
only speeds up stat() operations by caching the relevant inodes. Support can be
added in the future to cache extended attribute blocks as well. This works by 
first
collecting all the inode numbers of the directory's entries (subject to a 
numeric
or memory cap). This list is sorted by inode disk block order and passed on to
workqueues to perform lookups on the inodes asynchronously to bring them into 
the
cache.

2. xgetdents() - I had posted a version of this patchset some time last year and
it is largely unchanged - I just ported it to the latest upstream kernel. It
allows the user to request a combination of entries, stat and xattrs 
(keys/values)
for a directory. The stat portion is based on David Howells' xstat patchset he
had posted last year as well. I've included the relevant vfs bits in my 
patchset.
xgetdents() in GFS2 works in two phases. In the first phase, it collects all the
dirents by reading the directory in question. In phase two, it reads in inode
blocks and xattr blocks (if requested) for each entry after sorting the disk
accesses in block order. All of the intermediate data is stored in a buffer 
backed
by a vector of pages and is eventually transferred to the user supplied buffer.

Both syscalls perform significantly better than a simple getdents+stat with a 
cold
cache. The main advantage lies in being able to sort disk accesses for a bunch 
of
inodes in advance compared to seeking all over the disk for inodes one entry at 
a
time.

This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the
time taken to get directory entries and their respective stat info by 3 
different
sets of syscalls:

1) getdents+stat ('ls -l', basically) - Solid blue line
2) xgetdents with various buffer size and num_entries limits - Dotted lines
   Eg: v16384 d1 means a limit of 16384 pages for the scratch buffer and
   a maximum of 1 entries to collect at a time.
3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines
   Eg: d1 implies that it would fire off a max of 1 inode lookups during
   each syscall.

numfiles:  1 5 1050

getdents+stat   1.4s  220s   514s 2441s
xgetdents   1.2s   43s75s 1710s
dirreadahead+getdents+stat  1.1s5s68s  391s

Here is a seekwatcher graph from a test run on a directory of 5 files. 
(https://www.dropbox.com/s/fma8d4jzh7365lh/5-combined.png) The comparison is
between getdents+stat and xgetdents. The first set of plots is of getdents+stat,
followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) 
and
num_entries (100 to 100) limits. One can see the effect of ordering the disk
reads in the Disk IO portion of the graphs and the corresponding effect on 
seeks,
throughput and overall time taken.

This second seekwatcher graph similarly shows the 
dirreadahead()+getdents()+stat()
syscall-combo for a 50-file directory with increasing num_entries (100 to
100) limits versus getdents+stat.
(https://www.dropbox.com/s/rrhvamu99th3eae/50-ra_combined_new.png)
The corresponding getdents+stat baseline for this run is at the top of the 
series
of graphs.

I'm posting these two patchsets shortly for comments.

Cheers!
--Abhi

Red Hat Filesystems
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Zach Brown
On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
 Hi all,
 
 The topic of a readdirplus-like syscall had come up for discussion at last 
 year's
 LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
 implementations
 to get at a directory's entries as well as stat() info on the individual 
 inodes.
 I'm presenting these patches and some early test results on a single-node GFS2
 filesystem.
 
 1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
 system
 call below and scales very well for large directories in GFS2. dirreadahead() 
 is
 designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Zach Brown
On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
 Hi,
 
 On 25/07/14 18:52, Zach Brown wrote:
 On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
 Hi all,
 
 The topic of a readdirplus-like syscall had come up for discussion at last 
 year's
 LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
 implementations
 to get at a directory's entries as well as stat() info on the individual 
 inodes.
 I'm presenting these patches and some early test results on a single-node 
 GFS2
 filesystem.
 
 1. dirreadahead() - This patchset is very simple compared to the 
 xgetdents() system
 call below and scales very well for large directories in GFS2. 
 dirreadahead() is
 designed to be called prior to getdents+stat operations.
 Hmm.  Have you tried plumbing these read-ahead calls in under the normal
 getdents() syscalls?
 
 We don't have a filereadahead() syscall and yet we somehow manage to
 implement buffered file data read-ahead :).
 
 - z
 
 Well I'm not sure thats entirely true... we have readahead() and we also
 have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.

 doubt, but how would we tell getdents64() when we were going to read the
 inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?

How do the file systems that implement directory read-ahead today deal
with this?

Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Steven Whitehouse

Hi,

On 25/07/14 18:52, Zach Brown wrote:

On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:

Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last 
year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() 
system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z

Well I'm not sure thats entirely true... we have readahead() and we also 
have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no 
doubt, but how would we tell getdents64() when we were going to read the 
inodes, rather than just the file names? We may only want to readahead 
some subset of the directory entries rather than all of them, so the 
thought was to allow that flexibility by making it, its own syscall,


Steve.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Steven Whitehouse

Hi,

On 25/07/14 19:28, Zach Brown wrote:

On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:

Hi,

On 25/07/14 18:52, Zach Brown wrote:

[snip]

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z


Well I'm not sure thats entirely true... we have readahead() and we also
have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.


doubt, but how would we tell getdents64() when we were going to read the
inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?
In the file readahead case it has some context, and thats stored in the 
struct file. Thats where the problem lies in this case, the struct file 
relates to the directory, and when we then call open, or stat or 
whatever on some file within that directory, we don't pass the 
directory's fd to that open call, so we don't have a context to use. We 
could possibly look through the open fds relating to the process that 
called open to see if the parent dir of the inode we are opening is in 
there, in order to find the context to figure out whether to do 
readahead or not, but.. its not very nice to say the least.


I'm very much in agreement that doing this automatically is best, but 
that only works when its possible to get a very good estimate of whether 
the readahead is needed or not. That is much easier for file data than 
it is for inodes in a directory. If someone can figure out how to get 
around this problem though, then that is certainly something we'd like 
to look at.


The problem gets even more tricky in case the user only wants, say, half 
of the inodes in the directory... how does the kernel know which half?


The idea here is really to give some idea of the kind of performance 
gains that we might see with the readahead vs xgetdents approaches, and 
by the sizes of the patches, the relative complexity of the implementations.


I think overall, the readahead approach is the more flexible... if I had 
a directory full of files I wanted to truncate for example, it would be 
possible to use the same readahead to pull in the inodes quickly and 
then issue the truncates to the pre-cached inodes. That is something 
that would not be possible using xgetdents. Whether thats useful for 
real world applications or not remains to be seen, but it does show that 
it can handle more potential use cases than xgetdents. Also the ability 
to only readahead an application specific subset of inodes is a useful 
feature.


There is certainly a discussion to be had about how to specify the 
inodes that are wanted. Using the directory position is a relatively 
easy way to do it, and works well when most of the inodes in a directory 
are wanted. Specifying the file names would work better when fewer 
inodes are wanted, but then if very few are required, is readahead 
likely to give much of a gain anyway?... so thats why we chose the 
approach that we did.



How do the file systems that implement directory read-ahead today deal
with this?
I don't know of one that does - or at least readahead of the directory 
info itself is one thing (which is relatively easy, and done by many 
file systems) its reading ahead the inodes within the directory which is 
more complex, and what we are talking about here.



Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z
Thats perfectly ok - we hoped to generate some discussion and they are 
good questions,


Steve.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Trond Myklebust
On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse swhit...@redhat.com wrote:
 Hi,


 On 25/07/14 19:28, Zach Brown wrote:


 How do the file systems that implement directory read-ahead today deal
 with this?

 I don't know of one that does - or at least readahead of the directory info
 itself is one thing (which is relatively easy, and done by many file
 systems) its reading ahead the inodes within the directory which is more
 complex, and what we are talking about here.


NFS looks at whether or not there are lookup revalidations and/or
getattr calls in between the calls to readdir(). If there are, then we
assume an 'ls -l' workload, and continue to issue readdirplus calls to
the server.

Note that we also actively zap the readdir cache if we see getattr
calls over the wire, since the single call to readdirplus is usually
very much more efficient.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.mykleb...@primarydata.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

2014-07-25 Thread Dave Chinner
On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
 On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
  Hi all,
  
  The topic of a readdirplus-like syscall had come up for discussion at last 
  year's
  LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 
  implementations
  to get at a directory's entries as well as stat() info on the individual 
  inodes.
  I'm presenting these patches and some early test results on a single-node 
  GFS2
  filesystem.
  
  1. dirreadahead() - This patchset is very simple compared to the 
  xgetdents() system
  call below and scales very well for large directories in GFS2. 
  dirreadahead() is
  designed to be called prior to getdents+stat operations.
 
 Hmm.  Have you tried plumbing these read-ahead calls in under the normal
 getdents() syscalls?

The issue is not directory block readahead (which some filesystems
like XFS already have), but issuing inode readahead during the
getdents() syscall.

It's the semi-random, interleaved inode IO that is being optimised
here (i.e. queued, ordered, issued, cached), not the directory
blocks themselves. As such, why does this need to be done in the
kernel?  This can all be done in userspace, and even hidden within
the readdir() or ftw/ntfw() implementations themselves so it's OS,
kernel and filesystem independent..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/