On Tue, May 14, 2013 at 12:14 PM, Ben Hoyt wrote:
> I don't think that's a big issue, however. If it's 3-8x faster in the
> majority of cases (local disk on all systems, Windows networking), and
> no slower in a minority (sshfs), I'm not too sad about that.
Might be interesting to test something
Very interesting. Although os.walk may not be widely used in cluster
applications, anything that lowers the number of calls to stat() in an
spplication is worthwhile for parallel filesystems as stat() is handled by
the only non-parallel node, the MDS.
Small test on another NFS drive:
Creating tree
> I wonder how sshfs compared to nfs.
(I've modified your benchmark to also test the case where data isn't
in the page cache).
Local ext3:
cached:
os.walk took 0.096s, scandir.walk took 0.030s -- 3.2x as fast
uncached:
os.walk took 0.320s, scandir.walk took 0.130s -- 2.5x as fast
NFSv3, 1Gb/s ne
Le Tue, 14 May 2013 22:14:42 +1200,
Ben Hoyt a écrit :
> >> It should be no slower when it's all moved to C.
> >
> > The slowdown is too small to be interesting. The main point is that
> > there was no speedup, though.
>
> True, and thanks for testing.
>
> I don't think that's a big issue, howev
>> It should be no slower when it's all moved to C.
>
> The slowdown is too small to be interesting. The main point is that
> there was no speedup, though.
True, and thanks for testing.
I don't think that's a big issue, however. If it's 3-8x faster in the
majority of cases (local disk on all syst
Le Tue, 14 May 2013 21:10:08 +1200,
Ben Hoyt a écrit :
> > On a locally running VM:
> > os.walk took 0.400s, scandir.walk took 0.120s -- 3.3x as fast
> >
> > Same VM accessed from the host through a local sshfs:
> > os.walk took 2.261s, scandir.walk took 2.055s -- 1.1x as fast
> >
> > Same, but wi
> On a locally running VM:
> os.walk took 0.400s, scandir.walk took 0.120s -- 3.3x as fast
>
> Same VM accessed from the host through a local sshfs:
> os.walk took 2.261s, scandir.walk took 2.055s -- 1.1x as fast
>
> Same, but with "sshfs -o cache=no":
> os.walk took 24.060s, scandir.walk took 25.9
>> large to be more "real world". I've just tested it, and in practice
>> file system doesn't make much difference, so I've fixed that now:
>
> Thanks. I had bumped the number of files, thinking it would make things
> more interesting, and it filled my disk.
Denial of Pitrou attack -- sorry! :-) A
Le Tue, 14 May 2013 20:54:50 +1200,
Ben Hoyt a écrit :
> >> If anyone can run benchmark.py on Linux / NFS or similar, that'd be
> >> great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first
> >> and then move the "benchtree" to the network file system to run it
> >> against that.
> >
>
>> If anyone can run benchmark.py on Linux / NFS or similar, that'd be
>> great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first
>> and then move the "benchtree" to the network file system to run it
>> against that.
>
> Why does your benchmark create such large files? It doesn't make s
Le Tue, 14 May 2013 10:41:01 +1200,
Ben Hoyt a écrit :
>
> If anyone can run benchmark.py on Linux / NFS or similar, that'd be
> great. You'll probably have to lower DEPTH/NUM_DIRS/NUM_FILES first
> and then move the "benchtree" to the network file system to run it
> against that.
On a locally r
Le Tue, 14 May 2013 10:41:01 +1200,
Ben Hoyt a écrit :
> > I'd to see the numbers for NFS or CIFS - stat() can be brutally slow
> > over a network connection (that's why we added a caching mechanism
> > to importlib).
>
> How do I know what file system Windows networking is using? In any
> case,
On Sun, May 12, 2013 at 3:04 PM, Ben Hoyt wrote:
> > And if we're creating a custom object instead, why return a 2-tuple
> > rather than making the entry's name an attribute of the custom object?
> >
> > To me, that suggests a more reasonable API for os.scandir() might be
> > for it to be an iter
> OK, you got me! I'm now convinced that a property is a bad idea.
Thanks. :-)
> I still like to annotate that the function may return a cached value.
> Perhaps lstat() could require an argument?
>
> def lstat(self, cached):
> if not cached or self._lstat is None:
> self._
> I'd to see the numbers for NFS or CIFS - stat() can be brutally slow
> over a network connection (that's why we added a caching mechanism to
> importlib).
How do I know what file system Windows networking is using? In any
case, here's some numbers on Windows -- it's looking pretty good! This
is
Am 13.05.2013 02:21, schrieb Ben Hoyt:
> Are you suggesting just accessing .cached_lstat could call os.lstat()?
> That seems very bad to me. It's a property access -- it looks cheap,
> therefore people will expect it to be. From PEP 8 "Avoid using
> properties for computationally expensive operatio
On Mon, May 13, 2013 at 10:25 PM, Ben Hoyt wrote:
> Okay, I've renamed my "BetterWalk" module to "scandir" and updated it
> as per our discussion:
>
> https://github.com/benhoyt/scandir/#readme
Nice!
> PERFORMANCE: On Windows I'm seeing that scandir.walk() on a large test
> tree (see benchmark.p
Hi Ben,
Am 13.05.13 14:25, schrieb Ben Hoyt:
...It's not yet production-ready, and is basically still in API and
performance testing stage. ...
In any case, I really like the API (thanks mostly to Nick Coghlan),
and performance is great, even with DirEntry being written in Python.
PERFORMANCE:
Okay, I've renamed my "BetterWalk" module to "scandir" and updated it
as per our discussion:
https://github.com/benhoyt/scandir/#readme
It's not yet production-ready, and is basically still in API and
performance testing stage. For instance, the underlying scandir_helper
functions don't even retu
On Mon, May 13, 2013 at 12:11 PM, Victor Stinner
wrote:
> 2013/5/13 Ben Hoyt :
>> class DirEntry:
>> ...
>> def lstat(self):
>> if self._lstat is None:
>> self._lstat = os.lstat(os.path.join(self._path, self.name))
>> return self._lstat
>> ...
>
> You need to provid
> I would prefer to go the other route and don't expose lstat(). It's
> cleaner and less confusing to have a property cached_lstat on the object
> because it actually says what it contains. The property's internal code
> can do a lstat() call if necessary.
Are you suggesting just accessing .cached
2013/5/13 Ben Hoyt :
> class DirEntry:
> def __init__(self, name, dirent, lstat, path='.'):
> # User shouldn't need to call this, but called internally by scandir()
> self.name = name
> self.dirent = dirent
> self._lstat = lstat # non-public attributes
>
Am 13.05.2013 00:04, schrieb Ben Hoyt:
> In fact, I don't think .cached_lstat should be exposed to the user.
> They just call entry.lstat(), and it returns a cached stat or calls
> os.lstat() to get the real stat if required (and populates the
> internal cached stat value). And the entry.is* functi
> And if we're creating a custom object instead, why return a 2-tuple
> rather than making the entry's name an attribute of the custom object?
>
> To me, that suggests a more reasonable API for os.scandir() might be
> for it to be an iterator over "dir_entry" objects:
>
> name (as a string)
>
On Sun, May 12, 2013 at 2:30 AM, Nick Coghlan wrote:
> Once that core functionality is in place, *then* start debating what
> other use cases to optimise based on which platforms would support
> those optimisations and which would require dropping back to the full
> stat implementation anyway.
Al
On Sun, May 12, 2013 at 1:42 AM, Christian Heimes wrote:
> I suggest that we call it .lstat() and .cached_lstat to make clear that
> we are talking about no-follow stat() here.
Fair point.
> On platforms that support
> fstatat() it should use fstatat(dir_fd, name, &buf, AT_SYMLINK_NOFOLLOW)
> wh
Am 11.05.2013 16:34, schrieb Nick Coghlan:
> Here's the full set of fields on a current stat object:
>
> st_atime
> st_atime_ns
> st_blksize
> st_blocks
> st_ctime
> st_ctime_ns
> st_dev
> st_gid
> st_ino
> st_mode
> st_mtime
> st_mtime_ns
> st_nlink
> st_rdev
> st_size
> st_uid
And there are mor
On Sat, May 11, 2013 at 2:24 PM, Ben Hoyt wrote:
> In all the *practical* examples I've seen (and written myself), I
> iterate over a directory and I just need to know whether it's a file
> or directory (or maybe a link). Occassionally you need the size as
> well, but that would just mean a simila
> Have you actually tried the code? It can't give you correct answers. The
> struct dirent.d_type member as returned by readdir() has different
> values than stat.st_mode's file type.
Yes, I'm quite aware of that. In the first version of BetterWalk
that's exactly how it did it, and this approach w
> In the python-ideas list there's a thread "PEP: Extended stat_result"
> about adding methods to stat_result.
>
> Using that, you wouldn't necessarily have to look at st.st_mode. The method
> could perform an additional os.stat() if the field was None. For
>
> example:
>
> # Build lists of files a
Le Fri, 10 May 2013 23:53:37 +1000,
Nick Coghlan a écrit :
> On Fri, May 10, 2013 at 11:46 PM, Christian Heimes
> wrote:
> > Am 10.05.2013 14:16, schrieb Antoine Pitrou:
> >> But what if some systems return more than the file type and less
> >> than a full stat result? The general problem is POSI
On 10 May, 2013, at 16:30, MRAB wrote:
>>
> [snip]
> In the python-ideas list there's a thread "PEP: Extended stat_result"
> about adding methods to stat_result.
>
> Using that, you wouldn't necessarily have to look at st.st_mode. The method
> could perform an additional os.stat() if the field
On 10/05/2013 11:55, Ben Hoyt wrote:
A few of us were having a discussion at
http://bugs.python.org/issue11406 about adding os.scandir(): a
generator version of os.listdir() to make iterating over very large
directories more memory efficient. This also reflects how the OS gives
things to you -- i
On 10 May, 2013, at 15:54, Antoine Pitrou wrote:
> Le Fri, 10 May 2013 15:46:21 +0200,
> Christian Heimes a écrit :
>
>> Am 10.05.2013 14:16, schrieb Antoine Pitrou:
>>> But what if some systems return more than the file type and less
>>> than a full stat result? The general problem is POSIX's
On Fri, May 10, 2013 at 11:46 PM, Christian Heimes wrote:
> Am 10.05.2013 14:16, schrieb Antoine Pitrou:
>> But what if some systems return more than the file type and less than a
>> full stat result? The general problem is POSIX's terrible inertia.
>> I feel that a stat result with some None fiel
Le Fri, 10 May 2013 15:46:21 +0200,
Christian Heimes a écrit :
> Am 10.05.2013 14:16, schrieb Antoine Pitrou:
> > But what if some systems return more than the file type and less
> > than a full stat result? The general problem is POSIX's terrible
> > inertia. I feel that a stat result with some
Am 10.05.2013 14:16, schrieb Antoine Pitrou:
> But what if some systems return more than the file type and less than a
> full stat result? The general problem is POSIX's terrible inertia.
> I feel that a stat result with some None fields would be an acceptable
> compromise here.
POSIX only defines
On 10 May, 2013, at 14:16, Antoine Pitrou wrote:
> Le Fri, 10 May 2013 13:46:30 +0200,
> Christian Heimes a écrit :
>>
>> Hence I'm +1 on the general idea but -1 on something stat like. IMHO
>> os.scandir() should yield four objects:
>>
>> * name
>> * inode
>> * file type or DT_UNKNOWN
>> * s
Le Fri, 10 May 2013 13:46:30 +0200,
Christian Heimes a écrit :
>
> Hence I'm +1 on the general idea but -1 on something stat like. IMHO
> os.scandir() should yield four objects:
>
> * name
> * inode
> * file type or DT_UNKNOWN
> * stat_result or None
>
> stat_result shall only be returned w
Am 10.05.2013 12:55, schrieb Ben Hoyt:
> Higher-level functions like os.walk() would then check the fields they
> needed are not None, and only call os.stat() if needed, for example:
>
> # Build lists of files and directories in path
> files = []
> dirs = []
> for name, st in os.scandir(path):
>
A few of us were having a discussion at
http://bugs.python.org/issue11406 about adding os.scandir(): a
generator version of os.listdir() to make iterating over very large
directories more memory efficient. This also reflects how the OS gives
things to you -- it doesn't give you a big list, but you
41 matches
Mail list logo