Re: readdir/telldir/seekdir problem (i think)

2015-04-30 Thread Julian Elischer

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..


I have a patch for review at https://reviews.freebsd.org/D2410
which fixes this special case needed for samba, and improves
the behaviour generally, however the real fix is to utilise cookies
throughout the VFS and library so that this workaround is not needed.

The patch, when installed, allows Samba's torture tests to succeed on 
FreeBSD 8 through 11


here's a little test program

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define CHUNKSIZE 5
#define TOTALFILES 40

static void
SeekDir(DIR *dirp, long loc)
{
printf("Seeking back to location %ld\n", loc);
seekdir(dirp, loc);
}

static long
TellDir(DIR *dirp)
{
long loc;

loc = telldir(dirp);
printf("telldir assigned location %ld\n", loc);
return (loc);
}

int
main(int argc, char *argv[])
{
DIR*dirp;
inti;
intj;
longoffset = 0, prev_offset = 0;
char   *files[100];
charfilename[100];
intfd;
struct dirent  *dp = NULL;

if (chdir("./test2") != 0) {
err(EX_OSERR, "chdir");
}

/*/
/* Put a set of sample files in the target directory */
/*/

for (i=1; i < TOTALFILES ; i++)
{
sprintf(filename, "file-%d", i);
fd = open(filename, O_CREAT, 0666);
if (fd == -1) {
err(EX_OSERR, "open");
}
close(fd);
}
dirp = opendir(".");
offset = TellDir(dirp);
for (i = 0; i < 20; i++)
files[i] = malloc(20);

/***/
/* enumerate and delete small sets of files, one group */
/* at a time.  */
/***/
do {

/*/
/* Read in up to CHUNKSIZE file names*/
/* i will be the number of files we hold */
/*/
for (i = 0; i < CHUNKSIZE; i++) {
if ((dp = readdir(dirp)) != NULL) {
strcpy(files[i], dp->d_name);

printf("readdir (%ld) returned file %s\n",
offset, files[i]);

prev_offset = offset;
offset = TellDir(dirp);

} else {
printf("readdir returned null\n");
break;
}
}


//
/* Simuate the last entry not fitting into our (samba's) 
buffer */
/* If we read someting in on the last slot, push it 
back*/
/* Pretend it didn't fit. This is approximately what SAMBA 
does.*/

//
if (dp != NULL) {
/* Step back */
SeekDir(dirp, prev_offset);
offset = TellDir(dirp);
i--;
printf("file %s returned\n", files[i]);
}

/*/
/* i is the number of names we have left.*/
/*  Delete them. */
/*/
for (j = 0; j < i; j++) {
if (*files[j] == '.') {
printf ("skipping %s\n", files[j]);
} else {
printf("Unlinking file %s\n", files[j]);
if (unlink(files[j]) != 0) {
err(EX_OSERR, "unlink");
}
}
}
} while (dp != NULL);

closedir(dirp);
//chdir("..");

}

The program is simulating what Samba does when fails. (doing a 
recursive delete of a directory)
What it does is reads a chunk of names using readdir() until it's 
(small) buffer s full,
then it uses seekdir() to seek back before the last entry it read, 
(which fails to fit),

 theortically leaving it for the next read.
It then deletes the entries it found and repeats the cycle.

Eventually it should have found all the files in the directory and 
deleted them.

Except that it doesn't.

What actually happens is that some files are not enumerated, even 
though

the seekdir() should have made the readdir() find them.
for added fun. the FIRST seekdir appears to work. but all subsequent 
ones don't.


It behaves this way in -current , all the way back to at least 8.0.

if there's a bug in my program please let me know, but samba has the 
same problem.. e.g. on freeNAS.


to use the program make a directory called "./test2" and then run it 
in the current directory..
It fills it with files and then tried to  (fails) delete them in 
small batches.


here's some (annotated) output:

./testit
t

Re: readdir/telldir/seekdir problem (i think)

2015-04-30 Thread Rick Macklem
John Baldwin wrote:
> On Thursday, April 30, 2015 02:56:25 PM Julian Elischer wrote:
> > We really need to do something because the current system is really
> > broken.
> > And the fact that dirent has  *32 bit* inode number in it was a
> > shock.. I'd presumed
> > that had gone the way of the dinosaurs and dodo.
> > I think 11 needs to have a new dirent structure given out by a new
> > syscall.
> > (old one still present for compat reasons). Whether we need a
> > readdir64() and friends
> > I have not yet decided. Maybe it's time to bump libc's number again
> > :-)
> 
> This is the entire point of the ino64 branch (and project): to
> rototill
> struct stat and related structures so we have one ABI jump instead of
> lots of separate ABI jumps.  It bumps ino_t to 64 bits, dev_t to 32
> (IIRC),
> adds d_off to dirent, etc.  I believe the branch is able to do it all
> with
> symbol versioning rather than bumping libc.  However, this is why
> several of
> us keep harping on this as the real long-term solution. :)
> 
Yep. Btw, some of the stuff I did for the 64bit d_fileno patch I have
isn't in projects/ino64 because they felt it could go in after the main
patch. I haven't looked at how the new getdirentries() syscall + opendir()
and friends actually need to change to use d_off/d_cookie yet, but will
do so. (Btw, some like d_cookie since they feel it is more representative
of what it is while others prefer d_off because that is what others used.
I, personally, couldn't care less what it ends up being called.)

I will admit to having some concern w.r.t. progress on this. Last I
heard Gleb Kurtsou didn't have time to work on it and he's the main
author (from 2011, if I recall correctly).

I agree we really need to figure out how to get this in FreeBSD11.

rick
ps: With a naming change of dirent->dirent64, my patch could go into
head before projects/ino64, since all that happens is the the
high order 32bits of d_fileno == 0 until projects/ino64 goes in.
If others think it appropriate, I can do that.

> --
> John Baldwin
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-30 Thread John Baldwin
On Thursday, April 30, 2015 02:56:25 PM Julian Elischer wrote:
> We really need to do something because the current system is really 
> broken.
> And the fact that dirent has  *32 bit* inode number in it was a 
> shock.. I'd presumed
> that had gone the way of the dinosaurs and dodo.
> I think 11 needs to have a new dirent structure given out by a new 
> syscall.
> (old one still present for compat reasons). Whether we need a 
> readdir64() and friends
> I have not yet decided. Maybe it's time to bump libc's number again :-)

This is the entire point of the ino64 branch (and project): to rototill
struct stat and related structures so we have one ABI jump instead of
lots of separate ABI jumps.  It bumps ino_t to 64 bits, dev_t to 32 (IIRC),
adds d_off to dirent, etc.  I believe the branch is able to do it all with
symbol versioning rather than bumping libc.  However, this is why several of
us keep harping on this as the real long-term solution. :)

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-29 Thread Julian Elischer

On 4/25/15 5:52 AM, Jilles Tjoelker wrote:

On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:

Yes, this isn't at all safe.  There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning.  In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).
This might work for UFS by accident, but this is probably why ZFS
doesn't work.

Turns out they are all broken in the face of deletes.

I managed to make a patch that fixes things in the simple case of "backing
up by one (and only one time) record", which is what samba needs to do.

Samba reads until the latest entry won't fit (buffer is full)  and then
effectively pushes the last one back by doing a seekdir back one record.
It then sends the full buffer back to the client and continus reading 
at the

next record, which is the one it pushed back.

It turs out that because the buffer holding the last dirent presented 
is always
still resident in memory, it is ALWAYS possible to slip the 'current 
entry' pointer

back ONE entry, (but no more).

This unbreaks samba, even in the case of deletes. (at least with UFS 
and ZFS).


I'll make the patch available when I've tested it more and cleaned it up.




However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).

The ino64 branch only reserves space for d_off and does not use it in
any way. This is appropriate since actually using d_off is a major
feature addition.

d_cookie please.. :-)


A proper d_off would still be useful even if UFS's readdir keeps masking
off the offset so a directory read always starts at the beginning of a
512-byte directory block, since this allows more distinct offset values
than safely using getdirentries()'s *basep. With d_off, one outer loop
must read at least one directory block to avoid spinning indefinitely,
while using getdirentries()'s *basep requires reading the whole
getdirentries() buffer.

not sure I follow what you mean by that.


Some Linux filesystems go further and provide a unique d_off for each
entry.


yes ZFS provides that.
we currently don't export it beyond teh VFS, but there is code that is 
supposed to
supply  a cookie for each entry. It could be put into the dirent 
structure should
we want to do that. Currently it is put into a separate memory array 
in parallel
to the dirents, except that we give it a NULL pointer, which means "we 
don't want

cookies thanks" .

We really need to do something because the current system is really 
broken.
And the fact that dirent has  *32 bit* inode number in it was a 
shock.. I'd presumed

that had gone the way of the dinosaurs and dodo.
I think 11 needs to have a new dirent structure given out by a new 
syscall.
(old one still present for compat reasons). Whether we need a 
readdir64() and friends

I have not yet decided. Maybe it's time to bump libc's number again :-)





Another idea would be to store the last d_ino instead of dd_loc into the
struct ddloc. On seekdir(), this would seek to loc_seek as before and
skip entries until that d_ino is found, or to the start of the buffer if
not found (and possibly return some entries again that should not be
returned, but Samba copes with that).


I didn't see it coping with that..  an earlier version of my patch 
suffered from duplicate entries in samba.
it is also dangerous because multiple files linked to the same inode 
would get confused.


it may be possible however to link the telldir cookies to a HASH of 
the filename

and inode number.
(assuming of course we don't have cookies directly from the fiesystem 
itself.)

the current scheme of seeking to a buffer boundary and seeking forward in
the buffer looking for a matching entry would become quite robust 
using this scheme,

assuming that the lseek actually worked on that file system.




___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-27 Thread Julian Elischer

On 4/28/15 4:03 AM, Rick Macklem wrote:

Julian Elischer wrote:

On 4/25/15 4:28 AM, John Baldwin wrote:

On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:

On 4/25/15 1:30 AM, Julian Elischer wrote:

On 4/24/15 10:59 PM, John Baldwin wrote:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
   return;
   if (lp->loc_loc == dirp->dd_loc && lp->loc_seek ==
dirp->dd_seek)
   return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
SEEK_SET);
+   dirp->dd_seek = lp->loc_seek;
+   }

yes I did that yesterday but it still fails when you transition
blocks.. (badly).

I also tried bigger blocks.. also fails (eventually)

I did find a way to make it work...  you had to seek back
to the first block you deleted on each set..
then work forward from there again..  unfortunately since
I'm trying to make a microsoft program not fail (via samba)
I have no control over how it does things and seekdir doesn't
know what was deleted anyway... (so the fix is fine for  the
test program but not for real life)

I think I can make the BSD one act like the linux one by changing
the lseek being done to use the offset (loc) plus the buffer seek
address of the target, instead of just going for the buffer base
and
stepping forward through the entries..

maybe tomorrow.


The following conditional code makes ours behave the same as the
linux
one.
it breaks several 'rules' but works where ours is clean but
fails..
as Rick said..  "maybe that's what we should do too."


this is at the end of seekdir()


The new code does what linux does.. and shouldn't work.. but does
   // at least in the limited conditions I need it to.
   // We'll probably need to do this at work...:


The original code is what we have now, but gets mightily confused
sometimes.
  // This is clean(er) but fails in specific
  situations(when
doing commands
  // from Microft windows, via samba).


root@vps1:/tmp # diff -u dir.c.orig dir.c
--- dir.c.orig2015-04-24 11:29:36.855317000 -0700
+++ dir.c2015-04-24 11:15:49.05850 -0700
@@ -1105,6 +1105,13 @@
dirp->dd_loc = lp->loc_loc;
return;
}
+#ifdef GLIBC_SEEK
+(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc,
SEEK_SET);
+dirp->dd_seek = lp->loc_seek + lp->loc_loc;
+dirp->dd_loc = 0;
+lp->loc_seek = dirp->dd_seek;
+lp->loc_loc = 0;
+#else
(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
dirp->dd_seek = lp->loc_seek;
dirp->dd_loc = 0;
@@ -1114,6 +1121,7 @@
if (dp == NULL)
break;
}
+#endif
}

Yes, this isn't at all safe.  There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning.  In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).

This might work for UFS by accident, but this is probably why ZFS
doesn't work.

However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).

I just made the stunning discovery that our seekdir/readdir/telldir
code in libc works with
FreeBSD 8.0.
so maybe the problem is that the kernel changed it's behaviour, and
no-one thought to fix libc..

(at least it works on one of our 8.0 base appliances.. I'll do more
testing tomorrow.. it's past midnight.)


actually that was a mistake.. it fails on 8.0 as much as it fails on 
10.x and 11.

the patch above also fixes it on 8.0.  (UFS and ZFS)



I suspect that pre-r252438 systems work better for UFS than r252438
or later. That patch changed ufs_readdir() so that it no longer returned
the on-disk directory structure. (Among other things, it added code that
skipped over d_ino == 0 entries.)

yes but it was broken even before that.
basically here's the difference between what Linux (and mu patched 
code) does and what we do..


in Linux they seek directly to the base address of the block PLUS the 
offset of the entry being seeked.
The filesystem somehow correctly interprets this as teh start of the 
correct entry.
I guess it must return the correct information in the dirent structure 
in the first place.


in the unmodified BSD version, we seek to the start of the 'block' 
(actually the start of the buff

Re: readdir/telldir/seekdir problem (i think)

2015-04-27 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 4:28 AM, John Baldwin wrote:
> > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:
> >> On 4/25/15 1:30 AM, Julian Elischer wrote:
> >>> On 4/24/15 10:59 PM, John Baldwin wrote:
>  Index: head/lib/libc/gen/telldir.c
>  ===
>  --- head/lib/libc/gen/telldir.c (revision 281929)
>  +++ head/lib/libc/gen/telldir.c (working copy)
>  @@ -101,8 +101,10 @@
>    return;
>    if (lp->loc_loc == dirp->dd_loc && lp->loc_seek ==
>  dirp->dd_seek)
>    return;
>  -   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
>  SEEK_SET);
>  -   dirp->dd_seek = lp->loc_seek;
>  +   if (lp->loc_seek != dirp->dd_seek) {
>  +   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
>  SEEK_SET);
>  +   dirp->dd_seek = lp->loc_seek;
>  +   }
> >>> yes I did that yesterday but it still fails when you transition
> >>> blocks.. (badly).
> >>>
> >>> I also tried bigger blocks.. also fails (eventually)
> >>>
> >>> I did find a way to make it work...  you had to seek back
> >>> to the first block you deleted on each set..
> >>> then work forward from there again..  unfortunately since
> >>> I'm trying to make a microsoft program not fail (via samba)
> >>> I have no control over how it does things and seekdir doesn't
> >>> know what was deleted anyway... (so the fix is fine for  the
> >>> test program but not for real life)
> >>>
> >>> I think I can make the BSD one act like the linux one by changing
> >>> the lseek being done to use the offset (loc) plus the buffer seek
> >>> address of the target, instead of just going for the buffer base
> >>> and
> >>> stepping forward through the entries..
> >>>
> >>> maybe tomorrow.
> >>>
> >> The following conditional code makes ours behave the same as the
> >> linux
> >> one.
> >> it breaks several 'rules' but works where ours is clean but
> >> fails..
> >> as Rick said..  "maybe that's what we should do too."
> >>
> >>
> >> this is at the end of seekdir()
> >>
> >>
> >> The new code does what linux does.. and shouldn't work.. but does
> >>   // at least in the limited conditions I need it to.
> >>   // We'll probably need to do this at work...:
> >>
> >>
> >> The original code is what we have now, but gets mightily confused
> >> sometimes.
> >>  // This is clean(er) but fails in specific
> >>  situations(when
> >> doing commands
> >>  // from Microft windows, via samba).
> >>
> >>
> >> root@vps1:/tmp # diff -u dir.c.orig dir.c
> >> --- dir.c.orig2015-04-24 11:29:36.855317000 -0700
> >> +++ dir.c2015-04-24 11:15:49.05850 -0700
> >> @@ -1105,6 +1105,13 @@
> >>dirp->dd_loc = lp->loc_loc;
> >>return;
> >>}
> >> +#ifdef GLIBC_SEEK
> >> +(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc,
> >> SEEK_SET);
> >> +dirp->dd_seek = lp->loc_seek + lp->loc_loc;
> >> +dirp->dd_loc = 0;
> >> +lp->loc_seek = dirp->dd_seek;
> >> +lp->loc_loc = 0;
> >> +#else
> >>(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
> >>dirp->dd_seek = lp->loc_seek;
> >>dirp->dd_loc = 0;
> >> @@ -1114,6 +1121,7 @@
> >>if (dp == NULL)
> >>break;
> >>}
> >> +#endif
> >>}
> > Yes, this isn't at all safe.  There's no guarantee whatsoever that
> > the offset on the directory fd that isn't something returned by
> > getdirentries has any meaning.  In particular, the size of the
> > directory entry in a random filesystem might be a different size
> > than the structure returned by getdirentries (since it converts
> > things into a FS-independent format).
> >
> > This might work for UFS by accident, but this is probably why ZFS
> > doesn't work.
> >
> > However, this might be properly fixed by the thing that ino64 is
> > doing where each directory entry returned by getdirentries gives
> > you a seek offset that you _can_ directly seek to (as opposed to
> > seeking to the start of the block and then walking forward N
> > entries until you get an inter-block entry that is the same).
> I just made the stunning discovery that our seekdir/readdir/telldir
> code in libc works with
> FreeBSD 8.0.
> so maybe the problem is that the kernel changed it's behaviour, and
> no-one thought to fix libc..
> 
> (at least it works on one of our 8.0 base appliances.. I'll do more
> testing tomorrow.. it's past midnight.)
> 
I suspect that pre-r252438 systems work better for UFS than r252438
or later. That patch changed ufs_readdir() so that it no longer returned
the on-disk directory structure. (Among other things, it added code that
skipped over d_ino == 0 entries.)

As such, r252438 and later systems have UFS where the "logical" offset
of a directory entry returned by getdirentries() isn't the same as the
"physical" offset for it

Re: readdir/telldir/seekdir problem (i think)

2015-04-27 Thread Julian Elischer

On 4/25/15 4:28 AM, John Baldwin wrote:

On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:

On 4/25/15 1:30 AM, Julian Elischer wrote:

On 4/24/15 10:59 PM, John Baldwin wrote:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
  return;
  if (lp->loc_loc == dirp->dd_loc && lp->loc_seek ==
dirp->dd_seek)
  return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
SEEK_SET);
+   dirp->dd_seek = lp->loc_seek;
+   }

yes I did that yesterday but it still fails when you transition
blocks.. (badly).

I also tried bigger blocks.. also fails (eventually)

I did find a way to make it work...  you had to seek back
to the first block you deleted on each set..
then work forward from there again..  unfortunately since
I'm trying to make a microsoft program not fail (via samba)
I have no control over how it does things and seekdir doesn't
know what was deleted anyway... (so the fix is fine for  the
test program but not for real life)

I think I can make the BSD one act like the linux one by changing
the lseek being done to use the offset (loc) plus the buffer seek
address of the target, instead of just going for the buffer base and
stepping forward through the entries..

maybe tomorrow.


The following conditional code makes ours behave the same as the linux
one.
it breaks several 'rules' but works where ours is clean but fails..
as Rick said..  "maybe that's what we should do too."


this is at the end of seekdir()


The new code does what linux does.. and shouldn't work.. but does
  // at least in the limited conditions I need it to.
  // We'll probably need to do this at work...:


The original code is what we have now, but gets mightily confused
sometimes.
 // This is clean(er) but fails in specific situations(when
doing commands
 // from Microft windows, via samba).


root@vps1:/tmp # diff -u dir.c.orig dir.c
--- dir.c.orig2015-04-24 11:29:36.855317000 -0700
+++ dir.c2015-04-24 11:15:49.05850 -0700
@@ -1105,6 +1105,13 @@
   dirp->dd_loc = lp->loc_loc;
   return;
   }
+#ifdef GLIBC_SEEK
+(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc,
SEEK_SET);
+dirp->dd_seek = lp->loc_seek + lp->loc_loc;
+dirp->dd_loc = 0;
+lp->loc_seek = dirp->dd_seek;
+lp->loc_loc = 0;
+#else
   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
   dirp->dd_seek = lp->loc_seek;
   dirp->dd_loc = 0;
@@ -1114,6 +1121,7 @@
   if (dp == NULL)
   break;
   }
+#endif
   }

Yes, this isn't at all safe.  There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning.  In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).

This might work for UFS by accident, but this is probably why ZFS
doesn't work.

However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).
I just made the stunning discovery that our seekdir/readdir/telldir 
code in libc works with

FreeBSD 8.0.
so maybe the problem is that the kernel changed it's behaviour, and 
no-one thought to fix libc..


(at least it works on one of our 8.0 base appliances.. I'll do more 
testing tomorrow.. it's past midnight.)








___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
> what does linux use?
Btw, I found this:
  https://bugs.centos.org/view.php?id=5496
It appears that Linux has been having fun with this too, at least for NFS.

I still think that reading the whole directory is the only way to fix NFS.
(Unfortunately, they don't say how the Linux distros fixed it.;-)

Have fun with it, rick

> --
>In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
> POSIX.1-2001 specifies long, and this is the type used since
> glibc
> 2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> 
> In early filesystems, the value returned by telldir() was a
> simple
> file offset within a directory.  Modern filesystems use tree
> or hash
> structures, rather than flat tables, to represent
> directories.  On
> such filesystems, the value returned by telldir() (and used
> internally by readdir(3)) is a "cookie" that is used by the
> implementation to derive a position within a directory.
> Application
> programs should treat this strictly as an opaque value,
> making no
> assumptions about its contents.
> --
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 bit system would not have such large file systems
> on
> it..
> (maybe?)
> >
> > rick
> >
> >> A proper d_off would still be useful even if UFS's readdir keeps
> >> masking
> >> off the offset so a directory read always starts at the beginning
> >> of
> >> a
> >> 512-byte directory block, since this allows more distinct offset
> >> values
> >> than safely using getdirentries()'s *basep. With d_off, one outer
> >> loop
> >> must read at least one directory block to avoid spinning
> >> indefinitely,
> >> while using getdirentries()'s *basep requires reading the whole
> >> getdirentries() buffer.
> >>
> >> Some Linux filesystems go further and provide a unique d_off for
> >> each
> >> entry.
> >>
> >> Another idea would be to store the last d_ino instead of dd_loc
> >> into
> >> the
> >> struct ddloc. On seekdir(), this would seek to loc_seek as before
> >> and
> >> skip entries until that d_ino is found, or to the start of the
> >> buffer
> >> if
> >> not found (and possibly return some entries again that should not
> >> be
> >> returned, but Samba copes with that).
> >>
> >> --
> >> Jilles Tjoelker
> >> ___
> >> freebsd-current@freebsd

Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
Now that I've taken a look at the libc code, I can see that loc_XX
structures handle the telldir/seekdir stuff (mapping the seek position
plus offset within a block to an index).
I think d_off could be saved there as well (if the file system provides
it) and the new getdirentries() could take a "physical offset" argument
where this d_off could be passed down to the file system.

I think this would allow the problem to be handled properly "in general".

It still doesn't quite solve the NFS case.
- For NFS, a server is supposed to have a "directory_cookie_verifier", which
  allows the server to decide if a "directory offset cookie" is still valid.
  --> This has never been implemented for the FreeBSD server and it isn't easy.
  In part, real NFS servers don't do what the RFC describes, at least not 
in an
  obvious way. They "understand" if offset cookies still work after removal
  of an entry and also must store this "directory_offset_verifier" in the
  file's attributes. FreeBSD just never checks this verifier.

rick

> what does linux use?
> --
>In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
> POSIX.1-2001 specifies long, and this is the type used since
> glibc
> 2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> 
> In early filesystems, the value returned by telldir() was a
> simple
> file offset within a directory.  Modern filesystems use tree
> or hash
> structures, rather than flat tables, to represent
> directories.  On
> such filesystems, the value returned by telldir() (and used
> internally by readdir(3)) is a "cookie" that is used by the
> implementation to derive a position within a directory.
> Application
> programs should treat this strictly as an opaque value,
> making no
> assumptions about its contents.
> --
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 bit system would not have such large file systems
> on
> it..
> (maybe?)
> >
> > rick
> >
> >> A proper d_off would still be useful even if UFS's readdir keeps
> >> masking
> >> off the offset so a directory read always starts at the beginning
> >> of
> >> a
> >> 512-byte directory block, since this allows more distinct offset
> >> values
> >> than safely using getdirentries()'s *basep. With d_off, one outer
> >> loop
> >> must read at least one 

Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 5:52 AM, Jilles Tjoelker wrote:
> > On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >> Yes, this isn't at all safe.  There's no guarantee whatsoever that
> >> the offset on the directory fd that isn't something returned by
> >> getdirentries has any meaning.  In particular, the size of the
> >> directory entry in a random filesystem might be a different size
> >> than the structure returned by getdirentries (since it converts
> >> things into a FS-independent format).
> >> This might work for UFS by accident, but this is probably why ZFS
> >> doesn't work.
I took a quick look at UFS, ZFS and NFS:
- UFS appears to use the offset as a block + offset within
  the block, so it is understandable that it works for the simple
  "seek to byte offset" case.
  - One oddity here is that the loop that copies out directory entries
(about line#2188 in ufs_vnops.c) skips over d_ino == 0. Then uiomove()
copies the data out and offset is updated.
--> Seems to me that uio_offset "skips" entries with d_ino == 0, so
it isn't the actual byte offset in the on-disk directory.
*** Somehow, this seems it would break seekdir/telldir at some
point? (This handling of uio_offset seems just plain broken to me?)
- ZFS appears to number directory entries 1, 2, 3, 4, ...
  But, except for 1, 2, 3 (which are somewhat special, indicating ".",
  ".." and one other I've already forgotten, maybe ".zfs"), the rest
  are "serialized", which seems to mean that the byte offset is kept in
  a list along with some hash function used to find them.
  --> I didn't really follow the code, but it appears to "offset >> 28",
  so I don't think the low order 28bits are used to get the
  directory block.
  As such, I'm not too surprised that adding in the byte position within
  the block when doing lseek() doesn't break it.
  (I didn't really follow the stuff in zap_micro.c, so I could be all
   wrong here.)
  - I don't know if removal of an entry changes these indices 3, 4, ...?
- NFS gets the block and mostly ignores the byte position within the
  block. (ie. It does "uio_offset / NFS_DIRBLKSIZ" and then uses that
  to get a buffer cache block. This will, in turn, look up this value
  to find a cached cookie.)
  --> The only effect of "uio_offset % NFS_DIRBLKSIZ" != 0 is that it
  will return data from this logical position within a block for
  the readdir.
  --> If a Readdir RPC is done for the cached cookie after a Remove,
  it is up to the NFS server's implementation w.r.t. what you'll
  get back.

Not sure if the above is actually useful, at least until all file
systems are evaluated, rick.

> >> However, this might be properly fixed by the thing that ino64 is
> >> doing where each directory entry returned by getdirentries gives
> >> you a seek offset that you _can_ directly seek to (as opposed to
> >> seeking to the start of the block and then walking forward N
> >> entries until you get an inter-block entry that is the same).
> > The ino64 branch only reserves space for d_off and does not use it
> > in
> > any way. This is appropriate since actually using d_off is a major
> > feature addition.
> >
> > A proper d_off would still be useful even if UFS's readdir keeps
> > masking
> > off the offset so a directory read always starts at the beginning
> > of a
> > 512-byte directory block, since this allows more distinct offset
> > values
> > than safely using getdirentries()'s *basep. With d_off, one outer
> > loop
> > must read at least one directory block to avoid spinning
> > indefinitely,
> > while using getdirentries()'s *basep requires reading the whole
> > getdirentries() buffer.
> >
> > Some Linux filesystems go further and provide a unique d_off for
> > each
> > entry.
> >
> > Another idea would be to store the last d_ino instead of dd_loc
> > into the
> > struct ddloc. On seekdir(), this would seek to loc_seek as before
> > and
> > skip entries until that d_ino is found, or to the start of the
> > buffer if
> > not found (and possibly return some entries again that should not
> > be
> > returned, but Samba copes with that).
> 
> yes.. though maybe a hash of hte inode number and the name may be a
> better thing to search on..
> 
> I need to check on your statement about samba to see if its true for
> 3.6..
> 
> 
> >
> 
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 4:28 AM, John Baldwin wrote:
> > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:
> >> On 4/25/15 1:30 AM, Julian Elischer wrote:
> >>> On 4/24/15 10:59 PM, John Baldwin wrote:
>  On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:
> > On 4/23/15 9:54 PM, John Baldwin wrote:
> >> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> >>> On 4/23/15 11:20 AM, Julian Elischer wrote:
>  I'm debugging a problem being seen with samba 3.6.
> 
>  basically  telldir/seekdir/readdir don't seem to work as
>  advertised..
> >>> ok so it looks like readdir() (and friends) is totally broken
> >>> in
> >>> the face
> >>> of deletes unless you read the entire directory at once or
> >>> reset
> >>> to the
> >>> the first file before the deletes, or earlier.
> >> I'm not sure that Samba isn't assuming non-portable behavior.
> >> For example:
> >>
> >> >From
> >> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> >>
> >>
> >> If a file is removed from or added to the directory after the
> >> most recent call
> >> to opendir() or rewinddir(), whether a subsequent call to
> >> readdir() returns an
> >> entry for that file is unspecified.
> >>
> >> While this doesn't speak directly to your case, it does note
> >> that
> >> you will
> >> get inconsistencies if you scan a directory concurrent with
> >> add
> >> and remove.
> >>
> >> UFS might kind of work actually since deletes do not compact
> >> the
> >> backing
> >> directory, but I suspect NFS and ZFS would not work.  In
> >> addition, our
> >> current NFS support for seekdir is pretty flaky and can't be
> >> fixed without
> >> changes to return the seek offset for each directory entry (I
> >> believe that
> >> the projects/ino64 patches include this since they are
> >> breaking
> >> the ABI of
> >> the relevant structures already).  The ABI breakage makes this
> >> a
> >> very
> >> non-trivial task.  However, even if you have that per-item
> >> cookie, it is
> >> likely meaningless in the face of filesystems that use any
> >> sort
> >> of more
> >> advanced structure than an array (such as trees, etc.) to
> >> store
> >> directory
> >> entries.  POSIX specifically mentions this in the rationale
> >> for
> >> seekdir:
> >>
> >> One of the perceived problems of implementation is that
> >> returning
> >> to a given point in a directory is quite difficult to describe
> >> formally, in spite of its intuitive appeal, when systems that
> >> use
> >> B-trees, hashing functions, or other similar mechanisms to
> >> order
> >> their directories are considered. The definition of seekdir()
> >> and
> >> telldir() does not specify whether, when using these
> >> interfaces,
> >> a given directory entry will be seen at all, or more than
> >> once.
> >>
> >> In fact, given that quote, I would argue that what Samba is
> >> doing is
> >> non-portable.  This would seem to indicate that a conforming
> >> seekdir could
> >> just change readdir to immediately return EOF until you call
> >> rewinddir.
> > how does readdir know that the directory has been changed?
> > fstat?
>  Undefined.  FreeBSD's libc doesn't cache the entire directory
>  (unless you
>  are using a union mount), instead it just caches the most recent
>  call to
>  getdirentries().  It then serves up entries from that block
>  until
>  you hit
>  the end.  When you hit the end it reads the next block, etc.
>  When you
>  call telldir(), the kernel saves the seek offset from the start
>  of the
>  current block and the offset within that block and returns a
>  cookie to
>  you.  seekdir() looks up the cookie to find the (seek offset,
>  block
>  offset)
>  tuple.  If the location matches the directory's current location
>  it
>  doesn't
>  do anything, otherwise it seeks to the seek offset and reads a
>  new
>  block via
>  getdirentries().  There is no check for seeing if a directory is
>  changed.  Instead, you can only be stale by one "block".  When
>  you
>  read
>  a new block it is relative to the directory's state at that
>  time.
> 
>  Rick's suggestion of reusing the block for a seek within a block
>  would be
>  fairly easy to implement, I think it would just be:
> 
>  Index: head/lib/libc/gen/telldir.c
>  ===
>  --- head/lib/libc/gen/telldir.c (revision 281929)
>  +++ head/lib/libc/gen/telldir.c (working copy)
>  @@ -101,8 +101,10 @@
>    return;
> 

Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Jilles Toelker wrote:
> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> > Yes, this isn't at all safe.  There's no guarantee whatsoever that
> > the offset on the directory fd that isn't something returned by
> > getdirentries has any meaning.  In particular, the size of the
> > directory entry in a random filesystem might be a different size
> > than the structure returned by getdirentries (since it converts
> > things into a FS-independent format).
> 
> > This might work for UFS by accident, but this is probably why ZFS
> > doesn't work.
> 
> > However, this might be properly fixed by the thing that ino64 is
> > doing where each directory entry returned by getdirentries gives
> > you a seek offset that you _can_ directly seek to (as opposed to
> > seeking to the start of the block and then walking forward N
> > entries until you get an inter-block entry that is the same).
> 
> The ino64 branch only reserves space for d_off and does not use it in
> any way. This is appropriate since actually using d_off is a major
> feature addition.
> 
Just fyi, I did a patch:
  http://people.freebsd.org/~rmacklem/64bitfileno.patch
which does fill this field in for the various file systems and copies
the new "struct dirent" to "struct dirent32", so it works with an
unchanged userland.
Since the ino64 folks felt this should come after the ino64 change,
I didn't pursue committing it.
To commit it to head, it would have to be modified slightly, so that
the new structure is called "struct dirent64", so it doesn't screw
up userland builds using "struct dirent", but I think that is the
only change required for it to go into head/current if people felt
that was appropriate. (Oh, and it called d_off "d_cookie", although
I have no strong preference w.r.t. what it is called.)

If you look at it, you'll see most of the changes are for NFS,
but the rest were pretty trivial.

rick

> A proper d_off would still be useful even if UFS's readdir keeps
> masking
> off the offset so a directory read always starts at the beginning of
> a
> 512-byte directory block, since this allows more distinct offset
> values
> than safely using getdirentries()'s *basep. With d_off, one outer
> loop
> must read at least one directory block to avoid spinning
> indefinitely,
> while using getdirentries()'s *basep requires reading the whole
> getdirentries() buffer.
> 
> Some Linux filesystems go further and provide a unique d_off for each
> entry.
> 
> Another idea would be to store the last d_ino instead of dd_loc into
> the
> struct ddloc. On seekdir(), this would seek to loc_seek as before and
> skip entries until that d_ino is found, or to the start of the buffer
> if
> not found (and possibly return some entries again that should not be
> returned, but Samba copes with that).
> 
> --
> Jilles Tjoelker
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
> what does linux use?
> --
>In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
> POSIX.1-2001 specifies long, and this is the type used since
> glibc
> 2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> 
> In early filesystems, the value returned by telldir() was a
> simple
> file offset within a directory.  Modern filesystems use tree
> or hash
> structures, rather than flat tables, to represent
> directories.  On
> such filesystems, the value returned by telldir() (and used
> internally by readdir(3)) is a "cookie" that is used by the
> implementation to derive a position within a directory.
> Application
> programs should treat this strictly as an opaque value,
> making no
> assumptions about its contents.
> --
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
I believe that most of the FreeBSD file systems except UFS and ZFS just
copy the fields of their internal directory structure to fields in
"struct dirent", filling blocks sequentially. (Actually, I only took a
quick look, but ZFS might also be this way.)
As such, the "offsets" for FreeBSD are byte offsets into these "logical 
directory"
blocks.
The problem is (as already discussed) that there is no way to predict
how these will change for a given file system when entries are removed/added.
(I think the only way to "know" what the modified "logical directory" looks
 like is to read it again from the beginning, so that all the directory entries
 go through the conversion to logical again.)
UFS and the NFS client ensure that no "struct dirent" crosses a 512byte block
boundary. I don't think the other file systems do this. I mention this, since
the libc functions can't assume the UFS behaviour for this.
(At one time, UFS just "consumed" removed entries into the preceding "struct 
direct"
 or set it invalid, if it was the first entry in a 512byte block. This implied 
that
 the byte offsets (logical == physical) didn't change for subsequent entries 
upon
 a removal. It sounds like UFS is no longer doing this, from one of your posts?)

I am curious to see what glibc does, since I had assumed it just read the
entire directory at opendir/first-readdir.

rick
ps: This is what I recall from fooling with "struct dirent" a few months ago
and I'm getting old, so it may all be wrong;-)
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 

Re: readdir/telldir/seekdir problem (i think)

2015-04-25 Thread Rick Macklem
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
> what does linux use?
> --
>In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
> POSIX.1-2001 specifies long, and this is the type used since
> glibc
> 2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> 
> In early filesystems, the value returned by telldir() was a
> simple
> file offset within a directory.  Modern filesystems use tree
> or hash
> structures, rather than flat tables, to represent
> directories.  On
> such filesystems, the value returned by telldir() (and used
> internally by readdir(3)) is a "cookie" that is used by the
> implementation to derive a position within a directory.
> Application
> programs should treat this strictly as an opaque value,
> making no
> assumptions about its contents.
> --
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 bit system would not have such large file systems
> on
> it..
> (maybe?)
For NFS, the cookie is always an opaque 64bits. These cookies are cached
in the kernel by the client, with one for each "logical UFS-like directory
block generated for getdirentries(2)". As such, the NFS case does a 64bit->32bit
mapping. (Because of endianness etc, there is no guarantee that most of these
cookies are 0 for the high order 32bits.)

I need to look at the library code (both glibc and ours) before I understand
this better and can say more.

Have fun with it, rick
ps: The ino64 stuff will never be MFC'd, so it would be nice to "improve"
what the libc functions do without use of d_off.

> >
> > rick
> >
> >> A proper d_off would still be useful even if UFS's readdir keeps
> >> masking
> >> off the offset so a directory read always starts at the beginning
> >> of
> >> a
> >> 512-byte directory block, since this allows more distinct offset
> >> values
> >> than safely using getdirentries()'s *basep. With d_off, one outer
> >> loop
> >> must read at least one directory block to avoid spinning
> >> indefinitely,
> >> while using getdirentries()'s *basep requires reading the whole
> >> getdirentries() buffer.
> >>
> >> Some Linux filesystems go further and provide a unique d_off for
> >> each
> >> entry.
> >>
> >> Another idea would be to store the last d_ino instead of dd_loc
> >> into
> >> the
> >> struct ddloc. On seekdir(), this would seek to loc_seek as before
> >> and
> >> skip entries un

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/25/15 9:39 AM, Rick Macklem wrote:

Jilles Tjoelker wrote:

On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:

Yes, this isn't at all safe.  There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning.  In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).
This might work for UFS by accident, but this is probably why ZFS
doesn't work.
However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).

The ino64 branch only reserves space for d_off and does not use it in
any way. This is appropriate since actually using d_off is a major
feature addition.


Well, at some point ino64 will need to define a new getdirentries(2)
syscall and I believe this new syscall can have different/additional
arguments.

yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
everything else is implementation dependent.

I'd suggest that the new gtedirentries(2) syscall should return a
flag to indicate that the underlying file system is filling in d_off.
Then the libc functions can use d_off if it it available.
(They will still need to "work" at least as well as they do now if
  the file system doesn't support d_off. The old getdirentries(2) syscall
  will be returning the old/current "struct dirent" which doesn't have
  the field anyhow.)

Another bit of fun is that the argument for seekdir()/telldir() is a
long and ends up 32bits for some arches. d_off is 64bits, since that
is what some file systems require.

what does linux use?
--
  In glibc up to version 2.1.1, the return type of telldir() was 
off_t.

   POSIX.1-2001 specifies long, and this is the type used since glibc
   2.1.2.

also from the linux man page: this is interesting..


   In early filesystems, the value returned by telldir() was a simple
   file offset within a directory.  Modern filesystems use tree 
or hash

   structures, rather than flat tables, to represent directories.  On
   such filesystems, the value returned by telldir() (and used
   internally by readdir(3)) is a "cookie" that is used by the
   implementation to derive a position within a directory. 
Application

   programs should treat this strictly as an opaque value, making no
   assumptions about its contents.
--
but glibc uses the contents in a nonopaque (and possibly wrong) way 
itself in seekdir. .

(not following their own advice.)



Maybe the library code can only use d_off if it is a 64bit arch and
the file system is filling it in. (Or maybe the library can keep track
of 32<->64bit mappings for the offsets. I haven't looked at the libc
functions for a while, so I can't remember what they keep track of.)


one supposes a 32 bit system would not have such large file systems on 
it..

(maybe?)


rick


A proper d_off would still be useful even if UFS's readdir keeps
masking
off the offset so a directory read always starts at the beginning of
a
512-byte directory block, since this allows more distinct offset
values
than safely using getdirentries()'s *basep. With d_off, one outer
loop
must read at least one directory block to avoid spinning
indefinitely,
while using getdirentries()'s *basep requires reading the whole
getdirentries() buffer.

Some Linux filesystems go further and provide a unique d_off for each
entry.

Another idea would be to store the last d_ino instead of dd_loc into
the
struct ddloc. On seekdir(), this would seek to loc_seek as before and
skip entries until that d_ino is found, or to the start of the buffer
if
not found (and possibly return some entries again that should not be
returned, but Samba copes with that).

--
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
"freebsd-current-unsubscr...@freebsd.org"





___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/25/15 5:52 AM, Jilles Tjoelker wrote:

On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:

Yes, this isn't at all safe.  There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning.  In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).
This might work for UFS by accident, but this is probably why ZFS
doesn't work.
However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).

The ino64 branch only reserves space for d_off and does not use it in
any way. This is appropriate since actually using d_off is a major
feature addition.

A proper d_off would still be useful even if UFS's readdir keeps masking
off the offset so a directory read always starts at the beginning of a
512-byte directory block, since this allows more distinct offset values
than safely using getdirentries()'s *basep. With d_off, one outer loop
must read at least one directory block to avoid spinning indefinitely,
while using getdirentries()'s *basep requires reading the whole
getdirentries() buffer.

Some Linux filesystems go further and provide a unique d_off for each
entry.

Another idea would be to store the last d_ino instead of dd_loc into the
struct ddloc. On seekdir(), this would seek to loc_seek as before and
skip entries until that d_ino is found, or to the start of the buffer if
not found (and possibly return some entries again that should not be
returned, but Samba copes with that).


yes.. though maybe a hash of hte inode number and the name may be a 
better thing to search on..


I need to check on your statement about samba to see if its true for 3.6..






___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/25/15 4:28 AM, John Baldwin wrote:

On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:

On 4/25/15 1:30 AM, Julian Elischer wrote:

On 4/24/15 10:59 PM, John Baldwin wrote:

On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:

On 4/23/15 9:54 PM, John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as
advertised..

ok so it looks like readdir() (and friends) is totally broken in
the face
of deletes unless you read the entire directory at once or reset
to the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.
For example:

>From
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html


If a file is removed from or added to the directory after the
most recent call
to opendir() or rewinddir(), whether a subsequent call to
readdir() returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that
you will
get inconsistencies if you scan a directory concurrent with add
and remove.

UFS might kind of work actually since deletes do not compact the
backing
directory, but I suspect NFS and ZFS would not work.  In
addition, our
current NFS support for seekdir is pretty flaky and can't be
fixed without
changes to return the seek offset for each directory entry (I
believe that
the projects/ino64 patches include this since they are breaking
the ABI of
the relevant structures already).  The ABI breakage makes this a
very
non-trivial task.  However, even if you have that per-item
cookie, it is
likely meaningless in the face of filesystems that use any sort
of more
advanced structure than an array (such as trees, etc.) to store
directory
entries.  POSIX specifically mentions this in the rationale for
seekdir:

One of the perceived problems of implementation is that returning
to a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces,
a given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming
seekdir could
just change readdir to immediately return EOF until you call
rewinddir.

how does readdir know that the directory has been changed? fstat?

Undefined.  FreeBSD's libc doesn't cache the entire directory
(unless you
are using a union mount), instead it just caches the most recent
call to
getdirentries().  It then serves up entries from that block until
you hit
the end.  When you hit the end it reads the next block, etc. When you
call telldir(), the kernel saves the seek offset from the start of the
current block and the offset within that block and returns a cookie to
you.  seekdir() looks up the cookie to find the (seek offset, block
offset)
tuple.  If the location matches the directory's current location it
doesn't
do anything, otherwise it seeks to the seek offset and reads a new
block via
getdirentries().  There is no check for seeing if a directory is
changed.  Instead, you can only be stale by one "block".  When you
read
a new block it is relative to the directory's state at that time.

Rick's suggestion of reusing the block for a seek within a block
would be
fairly easy to implement, I think it would just be:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
  return;
  if (lp->loc_loc == dirp->dd_loc && lp->loc_seek ==
dirp->dd_seek)
  return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
SEEK_SET);
+   dirp->dd_seek = lp->loc_seek;
+   }

yes I did that yesterday but it still fails when you transition
blocks.. (badly).

I also tried bigger blocks.. also fails (eventually)

I did find a way to make it work...  you had to seek back
to the first block you deleted on each set..
then work forward from there again..  unfortunately since
I'm trying to make a microsoft program not fail (via samba)
I have no control over how it does things and seekdir doesn't
know what was deleted anyway... (so the fix is fine for  the
test program but not for real life)

I think I can make the BSD one act like the linux one by changing
the lseek being done to use the offset (loc) plus the buffer seek
address of the target, instead 

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Rick Macklem
Jilles Tjoelker wrote:
> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> > Yes, this isn't at all safe.  There's no guarantee whatsoever that
> > the offset on the directory fd that isn't something returned by
> > getdirentries has any meaning.  In particular, the size of the
> > directory entry in a random filesystem might be a different size
> > than the structure returned by getdirentries (since it converts
> > things into a FS-independent format).
> 
> > This might work for UFS by accident, but this is probably why ZFS
> > doesn't work.
> 
> > However, this might be properly fixed by the thing that ino64 is
> > doing where each directory entry returned by getdirentries gives
> > you a seek offset that you _can_ directly seek to (as opposed to
> > seeking to the start of the block and then walking forward N
> > entries until you get an inter-block entry that is the same).
> 
> The ino64 branch only reserves space for d_off and does not use it in
> any way. This is appropriate since actually using d_off is a major
> feature addition.
> 
Well, at some point ino64 will need to define a new getdirentries(2)
syscall and I believe this new syscall can have different/additional
arguments.

I'd suggest that the new gtedirentries(2) syscall should return a
flag to indicate that the underlying file system is filling in d_off.
Then the libc functions can use d_off if it it available.
(They will still need to "work" at least as well as they do now if
 the file system doesn't support d_off. The old getdirentries(2) syscall
 will be returning the old/current "struct dirent" which doesn't have
 the field anyhow.)

Another bit of fun is that the argument for seekdir()/telldir() is a
long and ends up 32bits for some arches. d_off is 64bits, since that
is what some file systems require.
Maybe the library code can only use d_off if it is a 64bit arch and
the file system is filling it in. (Or maybe the library can keep track
of 32<->64bit mappings for the offsets. I haven't looked at the libc
functions for a while, so I can't remember what they keep track of.)

rick

> A proper d_off would still be useful even if UFS's readdir keeps
> masking
> off the offset so a directory read always starts at the beginning of
> a
> 512-byte directory block, since this allows more distinct offset
> values
> than safely using getdirentries()'s *basep. With d_off, one outer
> loop
> must read at least one directory block to avoid spinning
> indefinitely,
> while using getdirentries()'s *basep requires reading the whole
> getdirentries() buffer.
> 
> Some Linux filesystems go further and provide a unique d_off for each
> entry.
> 
> Another idea would be to store the last d_ino instead of dd_loc into
> the
> struct ddloc. On seekdir(), this would seek to loc_seek as before and
> skip entries until that d_ino is found, or to the start of the buffer
> if
> not found (and possibly return some entries again that should not be
> returned, but Samba copes with that).
> 
> --
> Jilles Tjoelker
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Jilles Tjoelker
On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> Yes, this isn't at all safe.  There's no guarantee whatsoever that
> the offset on the directory fd that isn't something returned by
> getdirentries has any meaning.  In particular, the size of the
> directory entry in a random filesystem might be a different size
> than the structure returned by getdirentries (since it converts
> things into a FS-independent format).

> This might work for UFS by accident, but this is probably why ZFS
> doesn't work.

> However, this might be properly fixed by the thing that ino64 is
> doing where each directory entry returned by getdirentries gives
> you a seek offset that you _can_ directly seek to (as opposed to
> seeking to the start of the block and then walking forward N
> entries until you get an inter-block entry that is the same).

The ino64 branch only reserves space for d_off and does not use it in
any way. This is appropriate since actually using d_off is a major
feature addition.

A proper d_off would still be useful even if UFS's readdir keeps masking
off the offset so a directory read always starts at the beginning of a
512-byte directory block, since this allows more distinct offset values
than safely using getdirentries()'s *basep. With d_off, one outer loop
must read at least one directory block to avoid spinning indefinitely,
while using getdirentries()'s *basep requires reading the whole
getdirentries() buffer.

Some Linux filesystems go further and provide a unique d_off for each
entry.

Another idea would be to store the last d_ino instead of dd_loc into the
struct ddloc. On seekdir(), this would seek to loc_seek as before and
skip entries until that d_ino is found, or to the start of the buffer if
not found (and possibly return some entries again that should not be
returned, but Samba copes with that).

-- 
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread John Baldwin
On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:
> On 4/25/15 1:30 AM, Julian Elischer wrote:
> > On 4/24/15 10:59 PM, John Baldwin wrote:
> >> On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:
> >>> On 4/23/15 9:54 PM, John Baldwin wrote:
>  On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> > On 4/23/15 11:20 AM, Julian Elischer wrote:
> >> I'm debugging a problem being seen with samba 3.6.
> >>
> >> basically  telldir/seekdir/readdir don't seem to work as 
> >> advertised..
> > ok so it looks like readdir() (and friends) is totally broken in 
> > the face
> > of deletes unless you read the entire directory at once or reset 
> > to the
> > the first file before the deletes, or earlier.
>  I'm not sure that Samba isn't assuming non-portable behavior.  
>  For example:
> 
>  >From 
>  http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html 
> 
> 
>  If a file is removed from or added to the directory after the 
>  most recent call
>  to opendir() or rewinddir(), whether a subsequent call to 
>  readdir() returns an
>  entry for that file is unspecified.
> 
>  While this doesn't speak directly to your case, it does note that 
>  you will
>  get inconsistencies if you scan a directory concurrent with add 
>  and remove.
> 
>  UFS might kind of work actually since deletes do not compact the 
>  backing
>  directory, but I suspect NFS and ZFS would not work.  In 
>  addition, our
>  current NFS support for seekdir is pretty flaky and can't be 
>  fixed without
>  changes to return the seek offset for each directory entry (I 
>  believe that
>  the projects/ino64 patches include this since they are breaking 
>  the ABI of
>  the relevant structures already).  The ABI breakage makes this a 
>  very
>  non-trivial task.  However, even if you have that per-item 
>  cookie, it is
>  likely meaningless in the face of filesystems that use any sort 
>  of more
>  advanced structure than an array (such as trees, etc.) to store 
>  directory
>  entries.  POSIX specifically mentions this in the rationale for 
>  seekdir:
> 
>  One of the perceived problems of implementation is that returning 
>  to a given point in a directory is quite difficult to describe 
>  formally, in spite of its intuitive appeal, when systems that use 
>  B-trees, hashing functions, or other similar mechanisms to order 
>  their directories are considered. The definition of seekdir() and 
>  telldir() does not specify whether, when using these interfaces, 
>  a given directory entry will be seen at all, or more than once.
> 
>  In fact, given that quote, I would argue that what Samba is doing is
>  non-portable.  This would seem to indicate that a conforming 
>  seekdir could
>  just change readdir to immediately return EOF until you call 
>  rewinddir.
> >>> how does readdir know that the directory has been changed? fstat?
> >> Undefined.  FreeBSD's libc doesn't cache the entire directory 
> >> (unless you
> >> are using a union mount), instead it just caches the most recent 
> >> call to
> >> getdirentries().  It then serves up entries from that block until 
> >> you hit
> >> the end.  When you hit the end it reads the next block, etc. When you
> >> call telldir(), the kernel saves the seek offset from the start of the
> >> current block and the offset within that block and returns a cookie to
> >> you.  seekdir() looks up the cookie to find the (seek offset, block 
> >> offset)
> >> tuple.  If the location matches the directory's current location it 
> >> doesn't
> >> do anything, otherwise it seeks to the seek offset and reads a new 
> >> block via
> >> getdirentries().  There is no check for seeing if a directory is
> >> changed.  Instead, you can only be stale by one "block".  When you 
> >> read
> >> a new block it is relative to the directory's state at that time.
> >>
> >> Rick's suggestion of reusing the block for a seek within a block 
> >> would be
> >> fairly easy to implement, I think it would just be:
> >>
> >> Index: head/lib/libc/gen/telldir.c
> >> ===
> >> --- head/lib/libc/gen/telldir.c (revision 281929)
> >> +++ head/lib/libc/gen/telldir.c (working copy)
> >> @@ -101,8 +101,10 @@
> >>  return;
> >>  if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == 
> >> dirp->dd_seek)
> >>  return;
> >> -   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
> >> -   dirp->dd_seek = lp->loc_seek;
> >> +   if (lp->loc_seek != dirp->dd_seek) {
> >> +   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, 
> >> SEEK_SET);
> >> +   dirp->dd_seek = lp->loc_seek;
> >> +   }
> >
> > yes I did that yesterda

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/25/15 1:30 AM, Julian Elischer wrote:

On 4/24/15 10:59 PM, John Baldwin wrote:

On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:

On 4/23/15 9:54 PM, John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as 
advertised..
ok so it looks like readdir() (and friends) is totally broken in 
the face
of deletes unless you read the entire directory at once or reset 
to the

the first file before the deletes, or earlier.
I'm not sure that Samba isn't assuming non-portable behavior.  
For example:


>From 
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html 



If a file is removed from or added to the directory after the 
most recent call
to opendir() or rewinddir(), whether a subsequent call to 
readdir() returns an

entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that 
you will
get inconsistencies if you scan a directory concurrent with add 
and remove.


UFS might kind of work actually since deletes do not compact the 
backing
directory, but I suspect NFS and ZFS would not work.  In 
addition, our
current NFS support for seekdir is pretty flaky and can't be 
fixed without
changes to return the seek offset for each directory entry (I 
believe that
the projects/ino64 patches include this since they are breaking 
the ABI of
the relevant structures already).  The ABI breakage makes this a 
very
non-trivial task.  However, even if you have that per-item 
cookie, it is
likely meaningless in the face of filesystems that use any sort 
of more
advanced structure than an array (such as trees, etc.) to store 
directory
entries.  POSIX specifically mentions this in the rationale for 
seekdir:


One of the perceived problems of implementation is that returning 
to a given point in a directory is quite difficult to describe 
formally, in spite of its intuitive appeal, when systems that use 
B-trees, hashing functions, or other similar mechanisms to order 
their directories are considered. The definition of seekdir() and 
telldir() does not specify whether, when using these interfaces, 
a given directory entry will be seen at all, or more than once.


In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming 
seekdir could
just change readdir to immediately return EOF until you call 
rewinddir.

how does readdir know that the directory has been changed? fstat?
Undefined.  FreeBSD's libc doesn't cache the entire directory 
(unless you
are using a union mount), instead it just caches the most recent 
call to
getdirentries().  It then serves up entries from that block until 
you hit

the end.  When you hit the end it reads the next block, etc. When you
call telldir(), the kernel saves the seek offset from the start of the
current block and the offset within that block and returns a cookie to
you.  seekdir() looks up the cookie to find the (seek offset, block 
offset)
tuple.  If the location matches the directory's current location it 
doesn't
do anything, otherwise it seeks to the seek offset and reads a new 
block via

getdirentries().  There is no check for seeing if a directory is
changed.  Instead, you can only be stale by one "block".  When you 
read

a new block it is relative to the directory's state at that time.

Rick's suggestion of reusing the block for a seek within a block 
would be

fairly easy to implement, I think it would just be:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
 return;
 if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == 
dirp->dd_seek)

 return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, 
SEEK_SET);

+   dirp->dd_seek = lp->loc_seek;
+   }


yes I did that yesterday but it still fails when you transition 
blocks.. (badly).


I also tried bigger blocks.. also fails (eventually)

I did find a way to make it work...  you had to seek back
to the first block you deleted on each set..
then work forward from there again..  unfortunately since
I'm trying to make a microsoft program not fail (via samba)
I have no control over how it does things and seekdir doesn't
know what was deleted anyway... (so the fix is fine for  the
test program but not for real life)

I think I can make the BSD one act like the linux one by changing 
the lseek being done to use the offset (loc) plus the buffer seek 
address of the target, instead of just going for the buffer base and 
stepping forwar

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/24/15 10:59 PM, John Baldwin wrote:

On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:

On 4/23/15 9:54 PM, John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..

ok so it looks like readdir() (and friends) is totally broken in the face
of deletes unless you read the entire directory at once or reset to the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For example:

>From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most recent call
to opendir() or rewinddir(), whether a subsequent call to readdir() returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you will
get inconsistencies if you scan a directory concurrent with add and remove.

UFS might kind of work actually since deletes do not compact the backing
directory, but I suspect NFS and ZFS would not work.  In addition, our
current NFS support for seekdir is pretty flaky and can't be fixed without
changes to return the seek offset for each directory entry (I believe that
the projects/ino64 patches include this since they are breaking the ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it is
likely meaningless in the face of filesystems that use any sort of more
advanced structure than an array (such as trees, etc.) to store directory
entries.  POSIX specifically mentions this in the rationale for seekdir:

One of the perceived problems of implementation is that returning to a given 
point in a directory is quite difficult to describe formally, in spite of its 
intuitive appeal, when systems that use B-trees, hashing functions, or other 
similar mechanisms to order their directories are considered. The definition of 
seekdir() and telldir() does not specify whether, when using these interfaces, 
a given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir could
just change readdir to immediately return EOF until you call rewinddir.

how does readdir know that the directory has been changed? fstat?

Undefined.  FreeBSD's libc doesn't cache the entire directory (unless you
are using a union mount), instead it just caches the most recent call to
getdirentries().  It then serves up entries from that block until you hit
the end.  When you hit the end it reads the next block, etc.  When you
call telldir(), the kernel saves the seek offset from the start of the
current block and the offset within that block and returns a cookie to
you.  seekdir() looks up the cookie to find the (seek offset, block offset)
tuple.  If the location matches the directory's current location it doesn't
do anything, otherwise it seeks to the seek offset and reads a new block via
getdirentries().  There is no check for seeing if a directory is
changed.  Instead, you can only be stale by one "block".  When you read
a new block it is relative to the directory's state at that time.

Rick's suggestion of reusing the block for a seek within a block would be
fairly easy to implement, I think it would just be:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
 return;
 if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek)
 return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
+   dirp->dd_seek = lp->loc_seek;
+   }


yes I did that yesterday but it still fails when you transition 
blocks.. (badly).


I also tried bigger blocks.. also fails (eventually)

I did find a way to make it work...  you had to seek back
to the first block you deleted on each set..
then work forward from there again..  unfortunately since
I'm trying to make a microsoft program not fail (via samba)
I have no control over how it does things and seekdir doesn't
know what was deleted anyway... (so the fix is fine for  the
test program but not for real life)

I think I can make the BSD one act like the linux one by changing the 
lseek being done to use the offset (loc) plus the buffer seek address 
of the target, instead of just going for the buffer base and stepping 
forward through the entries..


maybe tomorrow.







 dirp->dd_loc = 0;
 wh

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/24/15 10:43 PM, John Baldwin wrote:

On Friday, April 24, 2015 06:42:01 PM Julian Elischer wrote:

On 4/24/15 6:12 AM, Rick Macklem wrote:

John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as
advertised..

ok so it looks like readdir() (and friends) is totally broken in
the face
of deletes unless you read the entire directory at once or reset to
the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For
example:

From
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most
recent call
to opendir() or rewinddir(), whether a subsequent call to readdir()
returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you
will
get inconsistencies if you scan a directory concurrent with add and
remove.

UFS might kind of work actually since deletes do not compact the
backing
directory, but I suspect NFS and ZFS would not work.  In addition,
our
current NFS support for seekdir is pretty flaky and can't be fixed
without
changes to return the seek offset for each directory entry (I believe
that
the projects/ino64 patches include this since they are breaking the
ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it
is
likely meaningless in the face of filesystems that use any sort of
more
advanced structure than an array (such as trees, etc.) to store
directory
entries.  POSIX specifically mentions this in the rationale for
seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to
a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces, a
given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir
could
just change readdir to immediately return EOF until you call
rewinddir.


Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked,
but I strongly suspect that it reads the entire directory upon either opendir()
or the first readdir().

Oh, and I hate to say it, but I suspect Linux defines the "standard" on
this and not POSIX. (In other words, if it works on Linux, it isn't broken;-)

rick

here's an interesting datapoint.  If the test program is run on
kFreeBSD using glibc, it runs without flaw.

OS-X  (bsd derived libc) HFS+ fails
FreeBSD libc (UFS) fails
FreeBSD libc (ZFS) fails
FreeBSD glibc  succceeds
Centos 6.5 glibc succeeds

some NFS tests would be nice to do too I guess...
glibc authors seem to have done something right.. it even copes with
FreeBSD kernel..

It's probably just reading the entire directory and caching it until
rewinddir is called.  FreeBSD's libc does this if you have a unionfs
mount.  It would be a trivial change to always do this, is just means
you will never notice any concurrent deletes, adds, or renames until
you call rewinddir again.  At that point you might as well have the
client just do rewinddir each time.  You are just moving the caching that
Samba should be doing to be portable from samba into libc.  I'm not sure
that's really an improvement so much as shuffling deck chairs.

Also, that is going to keep giving you directory entries for the files
you've already removed (unless you patch libc to explicitly hack around
that case by stating each entry and skipping ones that fail with ENOENT
which would add a lot of overhead).

SO I rewrote/ported glibc telldir/readdir to our FreeBSD..

Firstly, a slight addition.. BSD libc also  fails on tmpfs
( I found that out by  accident.. I thought I was on UFS and forgot I 
had a tmpfs there)


ported glibc readdir/friends works an all three.
and it is not caching the whole thing..

I'm doing the deletes in batches of 5.
and every 5 deletes I see it doing:
 95985 testit2  RET   write 30/0x1e
 95985 testit2  CALL  write(0x1,0x801007000,0x18)
 95985 testit2  GIO   fd 1 wrote 24 bytes
   "file file-1756 returned
   "
 95985 testit2  RET   write 24/0x18
 95985 testit2  CALL  write(0x1,0x801007000,0x1d)
 95985 testit2  GIO   fd 1 wrote 29 bytes
   "Seeking back to location 144
   "
 95985 testit2  RET   write 29/0x1d
 95985 testit2  CALL  lseek(0x3,0x90,SEEK_SET)
 95985 testit2  RET   lseek 144/0x90
 95985 testit2  CALL  write(0x1,0x801007000,0x1e)
 95985 t

Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/24/15 6:48 PM, Garrett Cooper wrote:

On Apr 24, 2015, at 3:42, Julian Elischer  wrote:


here's an interesting datapoint.  If the test program is run on kFreeBSD using 
glibc, it runs without flaw.

OS-X  (bsd derived libc) HFS+ fails
FreeBSD libc (UFS) fails
FreeBSD libc (ZFS) fails
FreeBSD glibc  succceeds
Centos 6.5 glibc succeeds

some NFS tests would be nice to do too I guess...
glibc authors seem to have done something right.. it even copes with FreeBSD 
kernel..

Hi Julian,
What version of FreeBSD are you running (uname -a) / coming up with the 
above results on? I’m asking because IIRC there was an issue that pho@ brought 
up with readdir/telldir/etc that was fixed on CURRENT in the past 1-2 years for 
us, that might be affecting you.
Cheers,
-NGie

11-current as of a few weeks ago


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread John Baldwin
On Friday, April 24, 2015 06:42:01 PM Julian Elischer wrote:
> On 4/24/15 6:12 AM, Rick Macklem wrote:
> > John Baldwin wrote:
> >> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> >>> On 4/23/15 11:20 AM, Julian Elischer wrote:
>  I'm debugging a problem being seen with samba 3.6.
> 
>  basically  telldir/seekdir/readdir don't seem to work as
>  advertised..
> >>> ok so it looks like readdir() (and friends) is totally broken in
> >>> the face
> >>> of deletes unless you read the entire directory at once or reset to
> >>> the
> >>> the first file before the deletes, or earlier.
> >> I'm not sure that Samba isn't assuming non-portable behavior.  For
> >> example:
> >>
> >> From
> >> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> >>
> >> If a file is removed from or added to the directory after the most
> >> recent call
> >> to opendir() or rewinddir(), whether a subsequent call to readdir()
> >> returns an
> >> entry for that file is unspecified.
> >>
> >> While this doesn't speak directly to your case, it does note that you
> >> will
> >> get inconsistencies if you scan a directory concurrent with add and
> >> remove.
> >>
> >> UFS might kind of work actually since deletes do not compact the
> >> backing
> >> directory, but I suspect NFS and ZFS would not work.  In addition,
> >> our
> >> current NFS support for seekdir is pretty flaky and can't be fixed
> >> without
> >> changes to return the seek offset for each directory entry (I believe
> >> that
> >> the projects/ino64 patches include this since they are breaking the
> >> ABI of
> >> the relevant structures already).  The ABI breakage makes this a very
> >> non-trivial task.  However, even if you have that per-item cookie, it
> >> is
> >> likely meaningless in the face of filesystems that use any sort of
> >> more
> >> advanced structure than an array (such as trees, etc.) to store
> >> directory
> >> entries.  POSIX specifically mentions this in the rationale for
> >> seekdir:
> >>
> >> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
> >>
> >> One of the perceived problems of implementation is that returning to
> >> a given point in a directory is quite difficult to describe
> >> formally, in spite of its intuitive appeal, when systems that use
> >> B-trees, hashing functions, or other similar mechanisms to order
> >> their directories are considered. The definition of seekdir() and
> >> telldir() does not specify whether, when using these interfaces, a
> >> given directory entry will be seen at all, or more than once.
> >>
> >> In fact, given that quote, I would argue that what Samba is doing is
> >> non-portable.  This would seem to indicate that a conforming seekdir
> >> could
> >> just change readdir to immediately return EOF until you call
> >> rewinddir.
> >>
> > Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked,
> > but I strongly suspect that it reads the entire directory upon either 
> > opendir()
> > or the first readdir().
> >
> > Oh, and I hate to say it, but I suspect Linux defines the "standard" on
> > this and not POSIX. (In other words, if it works on Linux, it isn't 
> > broken;-)
> >
> > rick
> 
> here's an interesting datapoint.  If the test program is run on 
> kFreeBSD using glibc, it runs without flaw.
> 
> OS-X  (bsd derived libc) HFS+ fails
> FreeBSD libc (UFS) fails
> FreeBSD libc (ZFS) fails
> FreeBSD glibc  succceeds
> Centos 6.5 glibc succeeds
> 
> some NFS tests would be nice to do too I guess...
> glibc authors seem to have done something right.. it even copes with 
> FreeBSD kernel..

It's probably just reading the entire directory and caching it until
rewinddir is called.  FreeBSD's libc does this if you have a unionfs
mount.  It would be a trivial change to always do this, is just means
you will never notice any concurrent deletes, adds, or renames until
you call rewinddir again.  At that point you might as well have the
client just do rewinddir each time.  You are just moving the caching that
Samba should be doing to be portable from samba into libc.  I'm not sure
that's really an improvement so much as shuffling deck chairs.

Also, that is going to keep giving you directory entries for the files
you've already removed (unless you patch libc to explicitly hack around
that case by stating each entry and skipping ones that fail with ENOENT
which would add a lot of overhead).

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread John Baldwin
On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:
> On 4/23/15 9:54 PM, John Baldwin wrote:
> > On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> >> On 4/23/15 11:20 AM, Julian Elischer wrote:
> >>> I'm debugging a problem being seen with samba 3.6.
> >>>
> >>> basically  telldir/seekdir/readdir don't seem to work as advertised..
> >> ok so it looks like readdir() (and friends) is totally broken in the face
> >> of deletes unless you read the entire directory at once or reset to the
> >> the first file before the deletes, or earlier.
> > I'm not sure that Samba isn't assuming non-portable behavior.  For example:
> >
> > >From 
> > >http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> >
> > If a file is removed from or added to the directory after the most recent 
> > call
> > to opendir() or rewinddir(), whether a subsequent call to readdir() returns 
> > an
> > entry for that file is unspecified.
> >
> > While this doesn't speak directly to your case, it does note that you will
> > get inconsistencies if you scan a directory concurrent with add and remove.
> >
> > UFS might kind of work actually since deletes do not compact the backing
> > directory, but I suspect NFS and ZFS would not work.  In addition, our
> > current NFS support for seekdir is pretty flaky and can't be fixed without
> > changes to return the seek offset for each directory entry (I believe that
> > the projects/ino64 patches include this since they are breaking the ABI of
> > the relevant structures already).  The ABI breakage makes this a very
> > non-trivial task.  However, even if you have that per-item cookie, it is
> > likely meaningless in the face of filesystems that use any sort of more
> > advanced structure than an array (such as trees, etc.) to store directory
> > entries.  POSIX specifically mentions this in the rationale for seekdir:
> >
> 
> >
> > One of the perceived problems of implementation is that returning to a 
> > given point in a directory is quite difficult to describe formally, in 
> > spite of its intuitive appeal, when systems that use B-trees, hashing 
> > functions, or other similar mechanisms to order their directories are 
> > considered. The definition of seekdir() and telldir() does not specify 
> > whether, when using these interfaces, a given directory entry will be seen 
> > at all, or more than once.
> >
> > In fact, given that quote, I would argue that what Samba is doing is
> > non-portable.  This would seem to indicate that a conforming seekdir could
> > just change readdir to immediately return EOF until you call rewinddir.
> 
> how does readdir know that the directory has been changed? fstat?

Undefined.  FreeBSD's libc doesn't cache the entire directory (unless you
are using a union mount), instead it just caches the most recent call to
getdirentries().  It then serves up entries from that block until you hit
the end.  When you hit the end it reads the next block, etc.  When you
call telldir(), the kernel saves the seek offset from the start of the
current block and the offset within that block and returns a cookie to
you.  seekdir() looks up the cookie to find the (seek offset, block offset)
tuple.  If the location matches the directory's current location it doesn't
do anything, otherwise it seeks to the seek offset and reads a new block via
getdirentries().  There is no check for seeing if a directory is
changed.  Instead, you can only be stale by one "block".  When you read
a new block it is relative to the directory's state at that time.

Rick's suggestion of reusing the block for a seek within a block would be
fairly easy to implement, I think it would just be:

Index: head/lib/libc/gen/telldir.c
===
--- head/lib/libc/gen/telldir.c (revision 281929)
+++ head/lib/libc/gen/telldir.c (working copy)
@@ -101,8 +101,10 @@
return;
if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek)
return;
-   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
-   dirp->dd_seek = lp->loc_seek;
+   if (lp->loc_seek != dirp->dd_seek) {
+   (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
+   dirp->dd_seek = lp->loc_seek;
+   }
dirp->dd_loc = 0;
while (dirp->dd_loc < lp->loc_loc) {
dp = _readdir_unlocked(dirp, 0);


(You still want to re-parse the block I think since the telldir offset
might be from an earlier version of the block and not point to a valid
directory entry since they are variable sized.)

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Garrett Cooper
On Apr 24, 2015, at 3:42, Julian Elischer  wrote:

> here's an interesting datapoint.  If the test program is run on kFreeBSD 
> using glibc, it runs without flaw.
> 
> OS-X  (bsd derived libc) HFS+ fails
> FreeBSD libc (UFS) fails
> FreeBSD libc (ZFS) fails
> FreeBSD glibc  succceeds
> Centos 6.5 glibc succeeds
> 
> some NFS tests would be nice to do too I guess...
> glibc authors seem to have done something right.. it even copes with FreeBSD 
> kernel..

Hi Julian,
What version of FreeBSD are you running (uname -a) / coming up with the 
above results on? I’m asking because IIRC there was an issue that pho@ brought 
up with readdir/telldir/etc that was fixed on CURRENT in the past 1-2 years for 
us, that might be affecting you.
Cheers,
-NGie


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: readdir/telldir/seekdir problem (i think)

2015-04-24 Thread Julian Elischer

On 4/24/15 6:12 AM, Rick Macklem wrote:

John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as
advertised..

ok so it looks like readdir() (and friends) is totally broken in
the face
of deletes unless you read the entire directory at once or reset to
the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For
example:

From
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most
recent call
to opendir() or rewinddir(), whether a subsequent call to readdir()
returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you
will
get inconsistencies if you scan a directory concurrent with add and
remove.

UFS might kind of work actually since deletes do not compact the
backing
directory, but I suspect NFS and ZFS would not work.  In addition,
our
current NFS support for seekdir is pretty flaky and can't be fixed
without
changes to return the seek offset for each directory entry (I believe
that
the projects/ino64 patches include this since they are breaking the
ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it
is
likely meaningless in the face of filesystems that use any sort of
more
advanced structure than an array (such as trees, etc.) to store
directory
entries.  POSIX specifically mentions this in the rationale for
seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to
a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces, a
given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir
could
just change readdir to immediately return EOF until you call
rewinddir.


Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked,
but I strongly suspect that it reads the entire directory upon either opendir()
or the first readdir().

Oh, and I hate to say it, but I suspect Linux defines the "standard" on
this and not POSIX. (In other words, if it works on Linux, it isn't broken;-)

rick


here's an interesting datapoint.  If the test program is run on 
kFreeBSD using glibc, it runs without flaw.


OS-X  (bsd derived libc) HFS+ fails
FreeBSD libc (UFS) fails
FreeBSD libc (ZFS) fails
FreeBSD glibc  succceeds
Centos 6.5 glibc succeeds

some NFS tests would be nice to do too I guess...
glibc authors seem to have done something right.. it even copes with 
FreeBSD kernel..







--
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
"freebsd-current-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"




___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Julian Elischer

On 4/24/15 5:50 AM, Rick Macklem wrote:

John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as
advertised..

ok so it looks like readdir() (and friends) is totally broken in
the face
of deletes unless you read the entire directory at once or reset to
the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For
example:

From
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most
recent call
to opendir() or rewinddir(), whether a subsequent call to readdir()
returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you
will
get inconsistencies if you scan a directory concurrent with add and
remove.

UFS might kind of work actually since deletes do not compact the
backing
directory, but I suspect NFS and ZFS would not work.  In addition,
our
current NFS support for seekdir is pretty flaky and can't be fixed
without
changes to return the seek offset for each directory entry (I believe
that
the projects/ino64 patches include this since they are breaking the
ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it
is
likely meaningless in the face of filesystems that use any sort of
more
advanced structure than an array (such as trees, etc.) to store
directory
entries.  POSIX specifically mentions this in the rationale for
seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to
a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces, a
given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir
could
just change readdir to immediately return EOF until you call
rewinddir.


Loosely related to this, I have been tempted to modify readdir() to
read the entire directory on the first readdir() for NFS, to avoid the
readdir()/unlink() problem.


I did find a bug in our readdir/seekdir that makes it a lot worse...
We reload the kernel's idea of the directory every time we do
a seekdir() back, even if it's within the same block,
which makes us a lot more susceptible to the problem..
making it not do that unless the new location is in another block made
it work on directories with up to several thousand files (with 32k 
blocksize)

before failing.
With that bug it's possible do make every seekdir() produce failures
even in a directory of just 3 files..  The downside is that the client
continues to see the old contents of the block even though he has done 
a seekdir()

within it.



My concern was doing this for a very large directory. Maybe it could be
done for directories up to some size?

rick


--
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
"freebsd-current-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"




___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Julian Elischer

On 4/23/15 9:54 PM, John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..

ok so it looks like readdir() (and friends) is totally broken in the face
of deletes unless you read the entire directory at once or reset to the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For example:

>From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most recent call
to opendir() or rewinddir(), whether a subsequent call to readdir() returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you will
get inconsistencies if you scan a directory concurrent with add and remove.

UFS might kind of work actually since deletes do not compact the backing
directory, but I suspect NFS and ZFS would not work.  In addition, our
current NFS support for seekdir is pretty flaky and can't be fixed without
changes to return the seek offset for each directory entry (I believe that
the projects/ino64 patches include this since they are breaking the ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it is
likely meaningless in the face of filesystems that use any sort of more
advanced structure than an array (such as trees, etc.) to store directory
entries.  POSIX specifically mentions this in the rationale for seekdir:





One of the perceived problems of implementation is that returning to a given 
point in a directory is quite difficult to describe formally, in spite of its 
intuitive appeal, when systems that use B-trees, hashing functions, or other 
similar mechanisms to order their directories are considered. The definition of 
seekdir() and telldir() does not specify whether, when using these interfaces, 
a given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir could
just change readdir to immediately return EOF until you call rewinddir.


how does readdir know that the directory has been changed? fstat?





___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Julian Elischer

On 4/24/15 6:12 AM, Rick Macklem wrote:

John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as
advertised..

ok so it looks like readdir() (and friends) is totally broken in
the face
of deletes unless you read the entire directory at once or reset to
the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For
example:

From
http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most
recent call
to opendir() or rewinddir(), whether a subsequent call to readdir()
returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you
will
get inconsistencies if you scan a directory concurrent with add and
remove.

UFS might kind of work actually since deletes do not compact the
backing
directory, but I suspect NFS and ZFS would not work.  In addition,
our
current NFS support for seekdir is pretty flaky and can't be fixed
without
changes to return the seek offset for each directory entry (I believe
that
the projects/ino64 patches include this since they are breaking the
ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it
is
likely meaningless in the face of filesystems that use any sort of
more
advanced structure than an array (such as trees, etc.) to store
directory
entries.  POSIX specifically mentions this in the rationale for
seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to
a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces, a
given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir
could
just change readdir to immediately return EOF until you call
rewinddir.


Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked,
but I strongly suspect that it reads the entire directory upon either opendir()
or the first readdir().

they make it work for everything apparently.. which is quite a trick.

Oh, and I hate to say it, but I suspect Linux defines the "standard" on
this and not POSIX. (In other words, if it works on Linux, it isn't broken;-)

rick


--
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
"freebsd-current-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"




___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Rick Macklem
John Baldwin wrote:
> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> > On 4/23/15 11:20 AM, Julian Elischer wrote:
> > > I'm debugging a problem being seen with samba 3.6.
> > >
> > > basically  telldir/seekdir/readdir don't seem to work as
> > > advertised..
> > 
> > ok so it looks like readdir() (and friends) is totally broken in
> > the face
> > of deletes unless you read the entire directory at once or reset to
> > the
> > the first file before the deletes, or earlier.
> 
> I'm not sure that Samba isn't assuming non-portable behavior.  For
> example:
> 
> From
> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> 
> If a file is removed from or added to the directory after the most
> recent call
> to opendir() or rewinddir(), whether a subsequent call to readdir()
> returns an
> entry for that file is unspecified.
> 
> While this doesn't speak directly to your case, it does note that you
> will
> get inconsistencies if you scan a directory concurrent with add and
> remove.
> 
> UFS might kind of work actually since deletes do not compact the
> backing
> directory, but I suspect NFS and ZFS would not work.  In addition,
> our
> current NFS support for seekdir is pretty flaky and can't be fixed
> without
> changes to return the seek offset for each directory entry (I believe
> that
> the projects/ino64 patches include this since they are breaking the
> ABI of
> the relevant structures already).  The ABI breakage makes this a very
> non-trivial task.  However, even if you have that per-item cookie, it
> is
> likely meaningless in the face of filesystems that use any sort of
> more
> advanced structure than an array (such as trees, etc.) to store
> directory
> entries.  POSIX specifically mentions this in the rationale for
> seekdir:
> 
> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
> 
> One of the perceived problems of implementation is that returning to
> a given point in a directory is quite difficult to describe
> formally, in spite of its intuitive appeal, when systems that use
> B-trees, hashing functions, or other similar mechanisms to order
> their directories are considered. The definition of seekdir() and
> telldir() does not specify whether, when using these interfaces, a
> given directory entry will be seen at all, or more than once.
> 
> In fact, given that quote, I would argue that what Samba is doing is
> non-portable.  This would seem to indicate that a conforming seekdir
> could
> just change readdir to immediately return EOF until you call
> rewinddir.
> 
Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked,
but I strongly suspect that it reads the entire directory upon either opendir()
or the first readdir().

Oh, and I hate to say it, but I suspect Linux defines the "standard" on
this and not POSIX. (In other words, if it works on Linux, it isn't broken;-)

rick

> --
> John Baldwin
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Rick Macklem
John Baldwin wrote:
> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> > On 4/23/15 11:20 AM, Julian Elischer wrote:
> > > I'm debugging a problem being seen with samba 3.6.
> > >
> > > basically  telldir/seekdir/readdir don't seem to work as
> > > advertised..
> > 
> > ok so it looks like readdir() (and friends) is totally broken in
> > the face
> > of deletes unless you read the entire directory at once or reset to
> > the
> > the first file before the deletes, or earlier.
> 
> I'm not sure that Samba isn't assuming non-portable behavior.  For
> example:
> 
> From
> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> 
> If a file is removed from or added to the directory after the most
> recent call
> to opendir() or rewinddir(), whether a subsequent call to readdir()
> returns an
> entry for that file is unspecified.
> 
> While this doesn't speak directly to your case, it does note that you
> will
> get inconsistencies if you scan a directory concurrent with add and
> remove.
> 
> UFS might kind of work actually since deletes do not compact the
> backing
> directory, but I suspect NFS and ZFS would not work.  In addition,
> our
> current NFS support for seekdir is pretty flaky and can't be fixed
> without
> changes to return the seek offset for each directory entry (I believe
> that
> the projects/ino64 patches include this since they are breaking the
> ABI of
> the relevant structures already).  The ABI breakage makes this a very
> non-trivial task.  However, even if you have that per-item cookie, it
> is
> likely meaningless in the face of filesystems that use any sort of
> more
> advanced structure than an array (such as trees, etc.) to store
> directory
> entries.  POSIX specifically mentions this in the rationale for
> seekdir:
> 
> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
> 
> One of the perceived problems of implementation is that returning to
> a given point in a directory is quite difficult to describe
> formally, in spite of its intuitive appeal, when systems that use
> B-trees, hashing functions, or other similar mechanisms to order
> their directories are considered. The definition of seekdir() and
> telldir() does not specify whether, when using these interfaces, a
> given directory entry will be seen at all, or more than once.
> 
> In fact, given that quote, I would argue that what Samba is doing is
> non-portable.  This would seem to indicate that a conforming seekdir
> could
> just change readdir to immediately return EOF until you call
> rewinddir.
> 
Loosely related to this, I have been tempted to modify readdir() to
read the entire directory on the first readdir() for NFS, to avoid the
readdir()/unlink() problem.

My concern was doing this for a very large directory. Maybe it could be
done for directories up to some size?

rick

> --
> John Baldwin
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
> 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Julian Elischer

On 4/23/15 9:54 PM, John Baldwin wrote:

On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..

ok so it looks like readdir() (and friends) is totally broken in the face
of deletes unless you read the entire directory at once or reset to the
the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For example:

>From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most recent call
to opendir() or rewinddir(), whether a subsequent call to readdir() returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you will
get inconsistencies if you scan a directory concurrent with add and remove.

UFS might kind of work actually since deletes do not compact the backing
directory, but I suspect NFS and ZFS would not work.


looking in ufs_getdirent (or whatever it's called) even UFS is packing
the list of files and removing deleted entries.
it's also what I'm seeing in practice.
if I do a seekdir back to the location of the first deleted file, I 
find a new file in that slot,

that has been shuffled down.


  In addition, our
current NFS support for seekdir is pretty flaky and can't be fixed without
changes to return the seek offset for each directory entry (I believe that
the projects/ino64 patches include this since they are breaking the ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it is
likely meaningless in the face of filesystems that use any sort of more
advanced structure than an array (such as trees, etc.) to store directory
entries.  POSIX specifically mentions this in the rationale for seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to a given 
point in a directory is quite difficult to describe formally, in spite of its 
intuitive appeal, when systems that use B-trees, hashing functions, or other 
similar mechanisms to order their directories are considered. The definition of 
seekdir() and telldir() does not specify whether, when using these interfaces, 
a given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir could
just change readdir to immediately return EOF until you call rewinddir.



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread John Baldwin
On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> On 4/23/15 11:20 AM, Julian Elischer wrote:
> > I'm debugging a problem being seen with samba 3.6.
> >
> > basically  telldir/seekdir/readdir don't seem to work as advertised..
> 
> ok so it looks like readdir() (and friends) is totally broken in the face
> of deletes unless you read the entire directory at once or reset to the
> the first file before the deletes, or earlier.

I'm not sure that Samba isn't assuming non-portable behavior.  For example:

>From http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html

If a file is removed from or added to the directory after the most recent call
to opendir() or rewinddir(), whether a subsequent call to readdir() returns an
entry for that file is unspecified.

While this doesn't speak directly to your case, it does note that you will
get inconsistencies if you scan a directory concurrent with add and remove.

UFS might kind of work actually since deletes do not compact the backing
directory, but I suspect NFS and ZFS would not work.  In addition, our
current NFS support for seekdir is pretty flaky and can't be fixed without
changes to return the seek offset for each directory entry (I believe that
the projects/ino64 patches include this since they are breaking the ABI of
the relevant structures already).  The ABI breakage makes this a very
non-trivial task.  However, even if you have that per-item cookie, it is
likely meaningless in the face of filesystems that use any sort of more
advanced structure than an array (such as trees, etc.) to store directory
entries.  POSIX specifically mentions this in the rationale for seekdir:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

One of the perceived problems of implementation is that returning to a given 
point in a directory is quite difficult to describe formally, in spite of its 
intuitive appeal, when systems that use B-trees, hashing functions, or other 
similar mechanisms to order their directories are considered. The definition of 
seekdir() and telldir() does not specify whether, when using these interfaces, 
a given directory entry will be seen at all, or more than once.

In fact, given that quote, I would argue that what Samba is doing is
non-portable.  This would seem to indicate that a conforming seekdir could
just change readdir to immediately return EOF until you call rewinddir.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: readdir/telldir/seekdir problem (i think)

2015-04-23 Thread Julian Elischer

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..


ok so it looks like readdir() (and friends) is totally broken in the face
of deletes unless you read the entire directory at once or reset to the
the first file before the deletes, or earlier.
So if you  use readdir() to enumerate files in order to delete them,
and you do it in sections in order to not have to load a potentially
huge list of files all at once, you can not do it unless you rewind to
0 after each set of unlinks. (or you do all the unlinks in one hit).

this appears to be because getdirentries (or maybe the filesystem behind)
will 'compact' the returned data.
At least this appears to be the case looking from the outside.
Next step is to look at the getdirentries code..





here's a little test program

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define CHUNKSIZE 5
#define TOTALFILES 40

static void
SeekDir(DIR *dirp, long loc)
{
printf("Seeking back to location %ld\n", loc);
seekdir(dirp, loc);
}

static long
TellDir(DIR *dirp)
{
long loc;

loc = telldir(dirp);
printf("telldir assigned location %ld\n", loc);
return (loc);
}

int
main(int argc, char *argv[])
{
DIR*dirp;
inti;
intj;
longoffset = 0, prev_offset = 0;
char   *files[100];
charfilename[100];
intfd;
struct dirent  *dp = NULL;

if (chdir("./test2") != 0) {
err(EX_OSERR, "chdir");
}

/*/
/* Put a set of sample files in the target directory */
/*/

for (i=1; i < TOTALFILES ; i++)
{
sprintf(filename, "file-%d", i);
fd = open(filename, O_CREAT, 0666);
if (fd == -1) {
err(EX_OSERR, "open");
}
close(fd);
}
dirp = opendir(".");
offset = TellDir(dirp);
for (i = 0; i < 20; i++)
files[i] = malloc(20);

/***/
/* enumerate and delete small sets of files, one group */
/* at a time.  */
/***/
do {

/*/
/* Read in up to CHUNKSIZE file names*/
/* i will be the number of files we hold */
/*/
for (i = 0; i < CHUNKSIZE; i++) {
if ((dp = readdir(dirp)) != NULL) {
strcpy(files[i], dp->d_name);

printf("readdir (%ld) returned file %s\n",
offset, files[i]);

prev_offset = offset;
offset = TellDir(dirp);

} else {
printf("readdir returned null\n");
break;
}
}


//
/* Simuate the last entry not fitting into our (samba's) 
buffer */
/* If we read someting in on the last slot, push it 
back*/
/* Pretend it didn't fit. This is approximately what SAMBA 
does.*/

//
if (dp != NULL) {
/* Step back */
SeekDir(dirp, prev_offset);
offset = TellDir(dirp);
i--;
printf("file %s returned\n", files[i]);
}

/*/
/* i is the number of names we have left.*/
/*  Delete them. */
/*/
for (j = 0; j < i; j++) {
if (*files[j] == '.') {
printf ("skipping %s\n", files[j]);
} else {
printf("Unlinking file %s\n", files[j]);
if (unlink(files[j]) != 0) {
err(EX_OSERR, "unlink");
}
}
}
} while (dp != NULL);

closedir(dirp);
//chdir("..");

}

The program is simulating what Samba does when fails. (doing a 
recursive delete of a directory)
What it does is reads a chunk of names using readdir() until it's 
(small) buffer s full,
then it uses seekdir() to seek back before the last entry it read, 
(which fails to fit),

 theortically leaving it for the next read.
It then deletes the entries it found and repeats the cycle.

Eventually it should have found all the files in the directory and 
deleted them.

Except that it doesn't.

What actually happens is that some files are not enumerated, even 
though

the seekdir() should have made the readdir() find them.
for added fun. the FIRST seekdir appears to work. but all subsequent 
ones don't.


It behaves this way in -current , all the way back to at

Re: readdir/telldir/seekdir problem (i think)

2015-04-22 Thread Julian Elischer

On 4/23/15 11:20 AM, Julian Elischer wrote:

I'm debugging a problem being seen with samba 3.6.

basically  telldir/seekdir/readdir don't seem to work as advertised..
On the intergooglewebs I see multiple previous reports of similar 
problems, some going back 30 years, but

we supposedly have the fixes for those in -current.

added data point.. if  I comment out the 'unlink' lines all the files 
are enumerated,
so it's something similar to the previously reported unlink-vs-readdir 
bugs. but different.
If I insert a rewinddir() after each set of deletes it all works right 
but that means the

system is continuously rereading all the prior directory blocks..

It goes wrong on at least UFS and ZFS. maybe others.
I'm not totally sure it can be fixed. If teh underlying directory 
blocs are

reshuffled or changed and things are moved, it may just be impossible
to keep any information as to where you are in the directory without
adding more state.



here's a little test program

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define CHUNKSIZE 5
#define TOTALFILES 40

static void
SeekDir(DIR *dirp, long loc)
{
printf("Seeking back to location %ld\n", loc);
seekdir(dirp, loc);
}

static long
TellDir(DIR *dirp)
{
long loc;

loc = telldir(dirp);
printf("telldir assigned location %ld\n", loc);
return (loc);
}

int
main(int argc, char *argv[])
{
DIR*dirp;
inti;
intj;
longoffset = 0, prev_offset = 0;
char   *files[100];
charfilename[100];
intfd;
struct dirent  *dp = NULL;

if (chdir("./test2") != 0) {
err(EX_OSERR, "chdir");
}

/*/
/* Put a set of sample files in the target directory */
/*/

for (i=1; i < TOTALFILES ; i++)
{
sprintf(filename, "file-%d", i);
fd = open(filename, O_CREAT, 0666);
if (fd == -1) {
err(EX_OSERR, "open");
}
close(fd);
}
dirp = opendir(".");
offset = TellDir(dirp);
for (i = 0; i < 20; i++)
files[i] = malloc(20);

/***/
/* enumerate and delete small sets of files, one group */
/* at a time.  */
/***/
do {

/*/
/* Read in up to CHUNKSIZE file names*/
/* i will be the number of files we hold */
/*/
for (i = 0; i < CHUNKSIZE; i++) {
if ((dp = readdir(dirp)) != NULL) {
strcpy(files[i], dp->d_name);

printf("readdir (%ld) returned file %s\n",
offset, files[i]);

prev_offset = offset;
offset = TellDir(dirp);

} else {
printf("readdir returned null\n");
break;
}
}


//
/* Simuate the last entry not fitting into our (samba's) 
buffer */
/* If we read someting in on the last slot, push it 
back*/
/* Pretend it didn't fit. This is approximately what SAMBA 
does.*/

//
if (dp != NULL) {
/* Step back */
SeekDir(dirp, prev_offset);
offset = TellDir(dirp);
i--;
printf("file %s returned\n", files[i]);
}

/*/
/* i is the number of names we have left.*/
/*  Delete them. */
/*/
for (j = 0; j < i; j++) {
if (*files[j] == '.') {
printf ("skipping %s\n", files[j]);
} else {
printf("Unlinking file %s\n", files[j]);
if (unlink(files[j]) != 0) {
err(EX_OSERR, "unlink");
}
}
}
} while (dp != NULL);

closedir(dirp);
//chdir("..");

}

The program is simulating what Samba does when fails. (doing a 
recursive delete of a directory)
What it does is reads a chunk of names using readdir() until it's 
(small) buffer s full,
then it uses seekdir() to seek back before the last entry it read, 
(which fails to fit),

 theortically leaving it for the next read.
It then deletes the entries it found and repeats the cycle.

Eventually it should have found all the files in the directory and 
deleted them.

Except that it doesn't.

What actually happens is that some files are not enumerated, even 
though

the seekdir() should have made the readdir() find them.
for added fun. the FIRST seekdir appears to wor