Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Arnd Bergmann
On Fri, Mar 16, 2018 at 3:59 AM, Theodore Y. Ts'o  wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000   Blocks: 112IO Block: 4096   regular file
> Device: fc01h/64513dInode: 19931278Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400

Note how the nanoseconds only differ in digits 2, 7, 8, and 9 though:

The atime update happened 4 jiffies (at HZ=100) after the mtime,
the low digits are presumably jitter or ntp adjustments.

This is the result of current_time() using the plain tk_xtime
rather than reading the highres clocksource as ktime_get_real_ts64()
does.

This was a performance optimization a long time ago. We could
make the current_time() behavior configurable if we want though,
e.g. at compile time, or as a per-mount option. It's probably more
common these days to have a highres clocksource that can
be read efficiently than it was back when current_fs_time()
was first introduced.

   Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Arnd Bergmann
On Fri, Mar 16, 2018 at 3:59 AM, Theodore Y. Ts'o  wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000   Blocks: 112IO Block: 4096   regular file
> Device: fc01h/64513dInode: 19931278Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400

Note how the nanoseconds only differ in digits 2, 7, 8, and 9 though:

The atime update happened 4 jiffies (at HZ=100) after the mtime,
the low digits are presumably jitter or ntp adjustments.

This is the result of current_time() using the plain tk_xtime
rather than reading the highres clocksource as ktime_get_real_ts64()
does.

This was a performance optimization a long time ago. We could
make the current_time() behavior configurable if we want though,
e.g. at compile time, or as a per-mount option. It's probably more
common these days to have a highres clocksource that can
be read efficiently than it was back when current_fs_time()
was first introduced.

   Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Darrick J. Wong
On Thu, Mar 15, 2018 at 11:17:54PM -0700, Andiry Xu wrote:
> On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o  wrote:
> > On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> >>
> >> You could also have a resolution of less than a nanosecond. Note
> >> that today, the file time stamps generated by the kernel are in
> >> jiffies resolution, so at best one millisecond. However, most modern
> >> file systems go with the 64+32 bit timestamps because it's not all
> >> that expensive.
> >
> > It actually depends on the architecture and the accuracy/granularity
> > of the timekeeping hardware available to the system, but it's possible
> > for the granularity of file time stamps to be up to one nanosecond.
> > So you can get results like this:
> >
> > % stat unix_io.o
> >   File: unix_io.o
> >   Size: 55000   Blocks: 112IO Block: 4096   regular file
> > Device: fc01h/64513dInode: 19931278Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> > Access: 2018-03-15 18:09:21.679914182 -0400
> > Modify: 2018-03-15 18:09:21.639914089 -0400
> > Change: 2018-03-15 18:09:21.639914089 -0400
> >
> 
> Thanks for all the suggestions. I think I will follow ext4's time
> format. 2446 should be far away enough.

If you do, try to avoid the encoding problems that ext4 (still) has:

Not-fixed-by: a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec")

--D

> Thanks,
> Andiry


Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Darrick J. Wong
On Thu, Mar 15, 2018 at 11:17:54PM -0700, Andiry Xu wrote:
> On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o  wrote:
> > On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> >>
> >> You could also have a resolution of less than a nanosecond. Note
> >> that today, the file time stamps generated by the kernel are in
> >> jiffies resolution, so at best one millisecond. However, most modern
> >> file systems go with the 64+32 bit timestamps because it's not all
> >> that expensive.
> >
> > It actually depends on the architecture and the accuracy/granularity
> > of the timekeeping hardware available to the system, but it's possible
> > for the granularity of file time stamps to be up to one nanosecond.
> > So you can get results like this:
> >
> > % stat unix_io.o
> >   File: unix_io.o
> >   Size: 55000   Blocks: 112IO Block: 4096   regular file
> > Device: fc01h/64513dInode: 19931278Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> > Access: 2018-03-15 18:09:21.679914182 -0400
> > Modify: 2018-03-15 18:09:21.639914089 -0400
> > Change: 2018-03-15 18:09:21.639914089 -0400
> >
> 
> Thanks for all the suggestions. I think I will follow ext4's time
> format. 2446 should be far away enough.

If you do, try to avoid the encoding problems that ext4 (still) has:

Not-fixed-by: a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec")

--D

> Thanks,
> Andiry


Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Andiry Xu
On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o  wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000   Blocks: 112IO Block: 4096   regular file
> Device: fc01h/64513dInode: 19931278Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400
>

Thanks for all the suggestions. I think I will follow ext4's time
format. 2446 should be far away enough.

Thanks,
Andiry


Re: [RFC v2 03/83] Add super.h.

2018-03-16 Thread Andiry Xu
On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o  wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000   Blocks: 112IO Block: 4096   regular file
> Device: fc01h/64513dInode: 19931278Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400
>

Thanks for all the suggestions. I think I will follow ext4's time
format. 2446 should be far away enough.

Thanks,
Andiry


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Theodore Y. Ts'o
On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> 
> You could also have a resolution of less than a nanosecond. Note
> that today, the file time stamps generated by the kernel are in
> jiffies resolution, so at best one millisecond. However, most modern
> file systems go with the 64+32 bit timestamps because it's not all
> that expensive.

It actually depends on the architecture and the accuracy/granularity
of the timekeeping hardware available to the system, but it's possible
for the granularity of file time stamps to be up to one nanosecond.
So you can get results like this:

% stat unix_io.o 
  File: unix_io.o
  Size: 55000   Blocks: 112IO Block: 4096   regular file
Device: fc01h/64513dInode: 19931278Links: 1
Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
Access: 2018-03-15 18:09:21.679914182 -0400
Modify: 2018-03-15 18:09:21.639914089 -0400
Change: 2018-03-15 18:09:21.639914089 -0400

Cheers,

- Ted


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Theodore Y. Ts'o
On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> 
> You could also have a resolution of less than a nanosecond. Note
> that today, the file time stamps generated by the kernel are in
> jiffies resolution, so at best one millisecond. However, most modern
> file systems go with the 64+32 bit timestamps because it's not all
> that expensive.

It actually depends on the architecture and the accuracy/granularity
of the timekeeping hardware available to the system, but it's possible
for the granularity of file time stamps to be up to one nanosecond.
So you can get results like this:

% stat unix_io.o 
  File: unix_io.o
  Size: 55000   Blocks: 112IO Block: 4096   regular file
Device: fc01h/64513dInode: 19931278Links: 1
Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
Access: 2018-03-15 18:09:21.679914182 -0400
Modify: 2018-03-15 18:09:21.639914089 -0400
Change: 2018-03-15 18:09:21.639914089 -0400

Cheers,

- Ted


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Arnd Bergmann
On Thu, Mar 15, 2018 at 6:51 PM, Andiry Xu  wrote:
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If this is a big concern, you could use a format similar to what ext4 has:
30 bits of nanoseconds, and 34 bits of seconds, where the upper two
bits count the epoch. That gives you a time range from years 1902 to
2446.

You could also have a resolution of less than a nanosecond. Note
that today, the file time stamps generated by the kernel are in
jiffies resolution, so at best one millisecond. However, most modern
file systems go with the 64+32 bit timestamps because it's not all
that expensive.

  Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Arnd Bergmann
On Thu, Mar 15, 2018 at 6:51 PM, Andiry Xu  wrote:
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If this is a big concern, you could use a format similar to what ext4 has:
30 bits of nanoseconds, and 34 bits of seconds, where the upper two
bits count the epoch. That gives you a time range from years 1902 to
2446.

You could also have a resolution of less than a nanosecond. Note
that today, the file time stamps generated by the kernel are in
jiffies resolution, so at best one millisecond. However, most modern
file systems go with the 64+32 bit timestamps because it's not all
that expensive.

  Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andreas Dilger
On Mar 15, 2018, at 11:51 AM, Andiry Xu  wrote:
> 
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>>>  wrote:
 On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> 
> + /* s_mtime and s_wtime should be together and their order should 
> not be
> +  * changed. we use an 8 byte write to update both of them atomically
> +  */
> + __le32  s_mtime;/* mount time */
> + __le32  s_wtime;/* write time */
 
 Hmmm, 32-bit timestamps?  2038 isn't that far away...
 
>>> 
>>> I will try fixing this in the next version.
>> 
>> I would also recommend adding nanosecond-resolution timestamps.
>> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
>> (it's good for several hundred years), but the more common format uses
>> 64-bit seconds and 32-bit nanoseconds in other file systems.
>> 
>> Unfortunately it looks, you will have to come up with a more sophisticated
>> update method above, even if you leave out the nanoseconds, you can't
>> easily rely on a 16-byte atomic update across architectures to deal with
>> the two 64-bit timestamps. For the superblock fields, you might be able
>> to get away with using second resolution, and then encoding the
>> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
>> 32-bit times added on top, which gives you a range of 136 years mount
>> a file system after its creation.
>> 
> 
> I will take a look at other file systems.
> 
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If the 64-bit atomicity was really a performance issue, you could do
something like:

__u32   time_high = seconds >> 32;
__u64   time_low = seconds << 32 | nanoseconds;

and then you only need to update time_high with a journal operation if it
has changed from the current time_high value (about once every 140 years),
and the time_low can be set atomically.  It needs a few extra cycles each
time (hidden with an unlikely()) vs. just setting both, but that is a win
if it avoids other CPU or IO overhead.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andreas Dilger
On Mar 15, 2018, at 11:51 AM, Andiry Xu  wrote:
> 
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>>>  wrote:
 On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> 
> + /* s_mtime and s_wtime should be together and their order should 
> not be
> +  * changed. we use an 8 byte write to update both of them atomically
> +  */
> + __le32  s_mtime;/* mount time */
> + __le32  s_wtime;/* write time */
 
 Hmmm, 32-bit timestamps?  2038 isn't that far away...
 
>>> 
>>> I will try fixing this in the next version.
>> 
>> I would also recommend adding nanosecond-resolution timestamps.
>> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
>> (it's good for several hundred years), but the more common format uses
>> 64-bit seconds and 32-bit nanoseconds in other file systems.
>> 
>> Unfortunately it looks, you will have to come up with a more sophisticated
>> update method above, even if you leave out the nanoseconds, you can't
>> easily rely on a 16-byte atomic update across architectures to deal with
>> the two 64-bit timestamps. For the superblock fields, you might be able
>> to get away with using second resolution, and then encoding the
>> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
>> 32-bit times added on top, which gives you a range of 136 years mount
>> a file system after its creation.
>> 
> 
> I will take a look at other file systems.
> 
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If the 64-bit atomicity was really a performance issue, you could do
something like:

__u32   time_high = seconds >> 32;
__u64   time_low = seconds << 32 | nanoseconds;

and then you only need to update time_high with a journal operation if it
has changed from the current time_high value (about once every 140 years),
and the time_low can be set atomically.  It needs a few extra cycles each
time (hidden with an unlikely()) vs. just setting both, but that is a win
if it avoids other CPU or IO overhead.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andiry Xu
On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>>  wrote:
>>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>
 + /* s_mtime and s_wtime should be together and their order should not 
 be
 +  * changed. we use an 8 byte write to update both of them atomically
 +  */
 + __le32  s_mtime;/* mount time */
 + __le32  s_wtime;/* write time */
>>>
>>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>>
>>
>> I will try fixing this in the next version.
>
> I would also recommend adding nanosecond-resolution timestamps.
> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
> (it's good for several hundred years), but the more common format uses
> 64-bit seconds and 32-bit nanoseconds in other file systems.
>
> Unfortunately it looks, you will have to come up with a more sophisticated
> update method above, even if you leave out the nanoseconds, you can't
> easily rely on a 16-byte atomic update across architectures to deal with
> the two 64-bit timestamps. For the superblock fields, you might be able
> to get away with using second resolution, and then encoding the
> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
> 32-bit times added on top, which gives you a range of 136 years mount
> a file system after its creation.
>

I will take a look at other file systems.

Superblock mtime is not a big problem as it is updated rarely. 64-bit
seconds and 32-bit nanoseconds make the inode and log entry bigger,
and updating file->atime cannot be done with a single 64bit update.
That may be annoying and needs to use journaling.

Thanks,
Andiry

>   Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andiry Xu
On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann  wrote:
> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>>  wrote:
>>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>
 + /* s_mtime and s_wtime should be together and their order should not 
 be
 +  * changed. we use an 8 byte write to update both of them atomically
 +  */
 + __le32  s_mtime;/* mount time */
 + __le32  s_wtime;/* write time */
>>>
>>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>>
>>
>> I will try fixing this in the next version.
>
> I would also recommend adding nanosecond-resolution timestamps.
> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
> (it's good for several hundred years), but the more common format uses
> 64-bit seconds and 32-bit nanoseconds in other file systems.
>
> Unfortunately it looks, you will have to come up with a more sophisticated
> update method above, even if you leave out the nanoseconds, you can't
> easily rely on a 16-byte atomic update across architectures to deal with
> the two 64-bit timestamps. For the superblock fields, you might be able
> to get away with using second resolution, and then encoding the
> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
> 32-bit times added on top, which gives you a range of 136 years mount
> a file system after its creation.
>

I will take a look at other file systems.

Superblock mtime is not a big problem as it is updated rarely. 64-bit
seconds and 32-bit nanoseconds make the inode and log entry bigger,
and updating file->atime cannot be done with a single 64bit update.
That may be annoying and needs to use journaling.

Thanks,
Andiry

>   Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Arnd Bergmann
On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>  wrote:
>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:

>>> + /* s_mtime and s_wtime should be together and their order should not 
>>> be
>>> +  * changed. we use an 8 byte write to update both of them atomically
>>> +  */
>>> + __le32  s_mtime;/* mount time */
>>> + __le32  s_wtime;/* write time */
>>
>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>
>
> I will try fixing this in the next version.

I would also recommend adding nanosecond-resolution timestamps.
In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
(it's good for several hundred years), but the more common format uses
64-bit seconds and 32-bit nanoseconds in other file systems.

Unfortunately it looks, you will have to come up with a more sophisticated
update method above, even if you leave out the nanoseconds, you can't
easily rely on a 16-byte atomic update across architectures to deal with
the two 64-bit timestamps. For the superblock fields, you might be able
to get away with using second resolution, and then encoding the
timestamps as a signed 64-bit 'mkfs time' along with two unsigned
32-bit times added on top, which gives you a range of 136 years mount
a file system after its creation.

  Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Arnd Bergmann
On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu  wrote:
> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>  wrote:
>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:

>>> + /* s_mtime and s_wtime should be together and their order should not 
>>> be
>>> +  * changed. we use an 8 byte write to update both of them atomically
>>> +  */
>>> + __le32  s_mtime;/* mount time */
>>> + __le32  s_wtime;/* write time */
>>
>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>
>
> I will try fixing this in the next version.

I would also recommend adding nanosecond-resolution timestamps.
In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
(it's good for several hundred years), but the more common format uses
64-bit seconds and 32-bit nanoseconds in other file systems.

Unfortunately it looks, you will have to come up with a more sophisticated
update method above, even if you leave out the nanoseconds, you can't
easily rely on a 16-byte atomic update across architectures to deal with
the two 64-bit timestamps. For the superblock fields, you might be able
to get away with using second resolution, and then encoding the
timestamps as a signed 64-bit 'mkfs time' along with two unsigned
32-bit times added on top, which gives you a range of 136 years mount
a file system after its creation.

  Arnd


Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andiry Xu
On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
 wrote:
> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> From: Andiry Xu 
>>
>> This header file defines NOVA persistent and volatile superblock
>> data structures.
>>
>> It also defines NOVA block layout:
>>
>> Page 0: Superblock
>> Page 1: Reserved inodes
>> Page 2 - 15: Reserved
>> Page 16 - 31: Inode table pointers
>> Page 32 - 47: Journal address pointers
>> Page 48 - 63: Reserved
>> Pages n-2: Replicate reserved inodes
>> Pages n-1: Replicate superblock
>>
>> Other pages are for normal inodes, logs and data.
>>
>> Signed-off-by: Andiry Xu 
>> ---
>>  fs/nova/super.h | 149 
>> 
>>  1 file changed, 149 insertions(+)
>>  create mode 100644 fs/nova/super.h
>>
>> diff --git a/fs/nova/super.h b/fs/nova/super.h
>> new file mode 100644
>> index 000..cb53908
>> --- /dev/null
>> +++ b/fs/nova/super.h
>> @@ -0,0 +1,149 @@
>> +#ifndef __SUPER_H
>> +#define __SUPER_H
>> +/*
>> + * Structure of the NOVA super block in PMEM
>> + *
>> + * The fields are partitioned into static and dynamic fields. The static 
>> fields
>> + * never change after file system creation. This was primarily done because
>> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
>> + * bugs). So if we modify any field using journaling (for consistency), we
>> + * will have to modify s_sum which is at offset 0. So journaling code fails.
>> + * This (static+dynamic fields) is a temporary solution and can be avoided
>> + * once the file system becomes stable and nova_get_block() returns correct
>> + * pointers even for offset 0.
>> + */
>> +struct nova_super_block {
>> + /* static fields. they never change after file system creation.
>> +  * checksum only validates up to s_start_dynamic field below
>> +  */
>> + __le32  s_sum;  /* checksum of this sb */
>> + __le32  s_magic;/* magic signature */
>> + __le32  s_padding32;
>> + __le32  s_blocksize;/* blocksize in bytes */
>> + __le64  s_size; /* total size of fs in bytes */
>> + chars_volume_name[16];  /* volume name */
>> +
>> + /* all the dynamic fields should go here */
>> + __le64  s_epoch_id; /* Epoch ID */
>> +
>> + /* s_mtime and s_wtime should be together and their order should not be
>> +  * changed. we use an 8 byte write to update both of them atomically
>> +  */
>> + __le32  s_mtime;/* mount time */
>> + __le32  s_wtime;/* write time */
>
> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>

I will try fixing this in the next version.

>> +} __attribute((__packed__));
>> +
>> +#define NOVA_SB_SIZE 512   /* must be power of two */
>> +
>> +/* === Reserved blocks = */
>> +
>> +/*
>> + * Page 0 contains super blocks;
>> + * Page 1 contains reserved inodes;
>> + * Page 2 - 15 are reserved.
>> + * Page 16 - 31 contain pointers to inode tables.
>> + * Page 32 - 47 contain pointers to journal pages.
>> + */
>> +#define  HEAD_RESERVED_BLOCKS64
>> +#define  NUM_JOURNAL_PAGES   16
>> +
>> +#define  SUPER_BLOCK_START   0 // Superblock
>> +#define  RESERVE_INODE_START 1 // Reserved inodes
>> +#define  INODE_TABLE_START   16 // inode table pointers
>> +#define  JOURNAL_START   32 // journal pointer table
>> +
>> +/* For replica super block and replica reserved inodes */
>> +#define  TAIL_RESERVED_BLOCKS2
>> +
>> +/* === Reserved inodes = */
>> +
>> +/* We have space for 31 reserved inodes */
>> +#define NOVA_ROOT_INO(1)
>> +#define NOVA_INODETABLE_INO  (2) /* Fake inode associated with inode
>> +  * stroage.  We need this because our
>> +  * allocator requires inode to be
>> +  * associated with each allocation.
>> +  * The data actually lives in linked
>> +  * lists in INODE_TABLE_START. */
>> +#define NOVA_BLOCKNODE_INO   (3) /* Storage for allocator state */
>> +#define NOVA_LITEJOURNAL_INO (4) /* Storage for lightweight journals */
>> +#define NOVA_INODELIST_INO   (5) /* Storage for Inode free list */
>> +
>> +
>> +/* Normal inode starts at 32 */
>> +#define NOVA_NORMAL_INODE_START  (32)
>
> I've been wondering this whole time, why not make the inode number the
> byte offset into the pmem?  Then you don't have to lose the last 8 bytes
> of each inode block to point to the next one.
>

During failure recovery, NOVA scans the inode logs. To find all the
inodes, it follows the 

Re: [RFC v2 03/83] Add super.h.

2018-03-15 Thread Andiry Xu
On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
 wrote:
> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> From: Andiry Xu 
>>
>> This header file defines NOVA persistent and volatile superblock
>> data structures.
>>
>> It also defines NOVA block layout:
>>
>> Page 0: Superblock
>> Page 1: Reserved inodes
>> Page 2 - 15: Reserved
>> Page 16 - 31: Inode table pointers
>> Page 32 - 47: Journal address pointers
>> Page 48 - 63: Reserved
>> Pages n-2: Replicate reserved inodes
>> Pages n-1: Replicate superblock
>>
>> Other pages are for normal inodes, logs and data.
>>
>> Signed-off-by: Andiry Xu 
>> ---
>>  fs/nova/super.h | 149 
>> 
>>  1 file changed, 149 insertions(+)
>>  create mode 100644 fs/nova/super.h
>>
>> diff --git a/fs/nova/super.h b/fs/nova/super.h
>> new file mode 100644
>> index 000..cb53908
>> --- /dev/null
>> +++ b/fs/nova/super.h
>> @@ -0,0 +1,149 @@
>> +#ifndef __SUPER_H
>> +#define __SUPER_H
>> +/*
>> + * Structure of the NOVA super block in PMEM
>> + *
>> + * The fields are partitioned into static and dynamic fields. The static 
>> fields
>> + * never change after file system creation. This was primarily done because
>> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
>> + * bugs). So if we modify any field using journaling (for consistency), we
>> + * will have to modify s_sum which is at offset 0. So journaling code fails.
>> + * This (static+dynamic fields) is a temporary solution and can be avoided
>> + * once the file system becomes stable and nova_get_block() returns correct
>> + * pointers even for offset 0.
>> + */
>> +struct nova_super_block {
>> + /* static fields. they never change after file system creation.
>> +  * checksum only validates up to s_start_dynamic field below
>> +  */
>> + __le32  s_sum;  /* checksum of this sb */
>> + __le32  s_magic;/* magic signature */
>> + __le32  s_padding32;
>> + __le32  s_blocksize;/* blocksize in bytes */
>> + __le64  s_size; /* total size of fs in bytes */
>> + chars_volume_name[16];  /* volume name */
>> +
>> + /* all the dynamic fields should go here */
>> + __le64  s_epoch_id; /* Epoch ID */
>> +
>> + /* s_mtime and s_wtime should be together and their order should not be
>> +  * changed. we use an 8 byte write to update both of them atomically
>> +  */
>> + __le32  s_mtime;/* mount time */
>> + __le32  s_wtime;/* write time */
>
> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>

I will try fixing this in the next version.

>> +} __attribute((__packed__));
>> +
>> +#define NOVA_SB_SIZE 512   /* must be power of two */
>> +
>> +/* === Reserved blocks = */
>> +
>> +/*
>> + * Page 0 contains super blocks;
>> + * Page 1 contains reserved inodes;
>> + * Page 2 - 15 are reserved.
>> + * Page 16 - 31 contain pointers to inode tables.
>> + * Page 32 - 47 contain pointers to journal pages.
>> + */
>> +#define  HEAD_RESERVED_BLOCKS64
>> +#define  NUM_JOURNAL_PAGES   16
>> +
>> +#define  SUPER_BLOCK_START   0 // Superblock
>> +#define  RESERVE_INODE_START 1 // Reserved inodes
>> +#define  INODE_TABLE_START   16 // inode table pointers
>> +#define  JOURNAL_START   32 // journal pointer table
>> +
>> +/* For replica super block and replica reserved inodes */
>> +#define  TAIL_RESERVED_BLOCKS2
>> +
>> +/* === Reserved inodes = */
>> +
>> +/* We have space for 31 reserved inodes */
>> +#define NOVA_ROOT_INO(1)
>> +#define NOVA_INODETABLE_INO  (2) /* Fake inode associated with inode
>> +  * stroage.  We need this because our
>> +  * allocator requires inode to be
>> +  * associated with each allocation.
>> +  * The data actually lives in linked
>> +  * lists in INODE_TABLE_START. */
>> +#define NOVA_BLOCKNODE_INO   (3) /* Storage for allocator state */
>> +#define NOVA_LITEJOURNAL_INO (4) /* Storage for lightweight journals */
>> +#define NOVA_INODELIST_INO   (5) /* Storage for Inode free list */
>> +
>> +
>> +/* Normal inode starts at 32 */
>> +#define NOVA_NORMAL_INODE_START  (32)
>
> I've been wondering this whole time, why not make the inode number the
> byte offset into the pmem?  Then you don't have to lose the last 8 bytes
> of each inode block to point to the next one.
>

During failure recovery, NOVA scans the inode logs. To find all the
inodes, it follows the inode block list. Making inode number the byte
offset cannot 

Re: [RFC v2 03/83] Add super.h.

2018-03-14 Thread Darrick J. Wong
On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
> From: Andiry Xu 
> 
> This header file defines NOVA persistent and volatile superblock
> data structures.
> 
> It also defines NOVA block layout:
> 
> Page 0: Superblock
> Page 1: Reserved inodes
> Page 2 - 15: Reserved
> Page 16 - 31: Inode table pointers
> Page 32 - 47: Journal address pointers
> Page 48 - 63: Reserved
> Pages n-2: Replicate reserved inodes
> Pages n-1: Replicate superblock
> 
> Other pages are for normal inodes, logs and data.
> 
> Signed-off-by: Andiry Xu 
> ---
>  fs/nova/super.h | 149 
> 
>  1 file changed, 149 insertions(+)
>  create mode 100644 fs/nova/super.h
> 
> diff --git a/fs/nova/super.h b/fs/nova/super.h
> new file mode 100644
> index 000..cb53908
> --- /dev/null
> +++ b/fs/nova/super.h
> @@ -0,0 +1,149 @@
> +#ifndef __SUPER_H
> +#define __SUPER_H
> +/*
> + * Structure of the NOVA super block in PMEM
> + *
> + * The fields are partitioned into static and dynamic fields. The static 
> fields
> + * never change after file system creation. This was primarily done because
> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
> + * bugs). So if we modify any field using journaling (for consistency), we
> + * will have to modify s_sum which is at offset 0. So journaling code fails.
> + * This (static+dynamic fields) is a temporary solution and can be avoided
> + * once the file system becomes stable and nova_get_block() returns correct
> + * pointers even for offset 0.
> + */
> +struct nova_super_block {
> + /* static fields. they never change after file system creation.
> +  * checksum only validates up to s_start_dynamic field below
> +  */
> + __le32  s_sum;  /* checksum of this sb */
> + __le32  s_magic;/* magic signature */
> + __le32  s_padding32;
> + __le32  s_blocksize;/* blocksize in bytes */
> + __le64  s_size; /* total size of fs in bytes */
> + chars_volume_name[16];  /* volume name */
> +
> + /* all the dynamic fields should go here */
> + __le64  s_epoch_id; /* Epoch ID */
> +
> + /* s_mtime and s_wtime should be together and their order should not be
> +  * changed. we use an 8 byte write to update both of them atomically
> +  */
> + __le32  s_mtime;/* mount time */
> + __le32  s_wtime;/* write time */

Hmmm, 32-bit timestamps?  2038 isn't that far away...

> +} __attribute((__packed__));
> +
> +#define NOVA_SB_SIZE 512   /* must be power of two */
> +
> +/* === Reserved blocks = */
> +
> +/*
> + * Page 0 contains super blocks;
> + * Page 1 contains reserved inodes;
> + * Page 2 - 15 are reserved.
> + * Page 16 - 31 contain pointers to inode tables.
> + * Page 32 - 47 contain pointers to journal pages.
> + */
> +#define  HEAD_RESERVED_BLOCKS64
> +#define  NUM_JOURNAL_PAGES   16
> +
> +#define  SUPER_BLOCK_START   0 // Superblock
> +#define  RESERVE_INODE_START 1 // Reserved inodes
> +#define  INODE_TABLE_START   16 // inode table pointers
> +#define  JOURNAL_START   32 // journal pointer table
> +
> +/* For replica super block and replica reserved inodes */
> +#define  TAIL_RESERVED_BLOCKS2
> +
> +/* === Reserved inodes = */
> +
> +/* We have space for 31 reserved inodes */
> +#define NOVA_ROOT_INO(1)
> +#define NOVA_INODETABLE_INO  (2) /* Fake inode associated with inode
> +  * stroage.  We need this because our
> +  * allocator requires inode to be
> +  * associated with each allocation.
> +  * The data actually lives in linked
> +  * lists in INODE_TABLE_START. */
> +#define NOVA_BLOCKNODE_INO   (3) /* Storage for allocator state */
> +#define NOVA_LITEJOURNAL_INO (4) /* Storage for lightweight journals */
> +#define NOVA_INODELIST_INO   (5) /* Storage for Inode free list */
> +
> +
> +/* Normal inode starts at 32 */
> +#define NOVA_NORMAL_INODE_START  (32)

I've been wondering this whole time, why not make the inode number the
byte offset into the pmem?  Then you don't have to lose the last 8 bytes
of each inode block to point to the next one.

--D

> +
> +
> +
> +/*
> + * NOVA super-block data in DRAM
> + */
> +struct nova_sb_info {
> + struct super_block *sb; /* VFS super block */
> + struct nova_super_block *nova_sb;   /* DRAM copy of SB */
> + struct block_device *s_bdev;
> + struct dax_device *s_dax_dev;
> +
> + /*
> +  * 

Re: [RFC v2 03/83] Add super.h.

2018-03-14 Thread Darrick J. Wong
On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
> From: Andiry Xu 
> 
> This header file defines NOVA persistent and volatile superblock
> data structures.
> 
> It also defines NOVA block layout:
> 
> Page 0: Superblock
> Page 1: Reserved inodes
> Page 2 - 15: Reserved
> Page 16 - 31: Inode table pointers
> Page 32 - 47: Journal address pointers
> Page 48 - 63: Reserved
> Pages n-2: Replicate reserved inodes
> Pages n-1: Replicate superblock
> 
> Other pages are for normal inodes, logs and data.
> 
> Signed-off-by: Andiry Xu 
> ---
>  fs/nova/super.h | 149 
> 
>  1 file changed, 149 insertions(+)
>  create mode 100644 fs/nova/super.h
> 
> diff --git a/fs/nova/super.h b/fs/nova/super.h
> new file mode 100644
> index 000..cb53908
> --- /dev/null
> +++ b/fs/nova/super.h
> @@ -0,0 +1,149 @@
> +#ifndef __SUPER_H
> +#define __SUPER_H
> +/*
> + * Structure of the NOVA super block in PMEM
> + *
> + * The fields are partitioned into static and dynamic fields. The static 
> fields
> + * never change after file system creation. This was primarily done because
> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
> + * bugs). So if we modify any field using journaling (for consistency), we
> + * will have to modify s_sum which is at offset 0. So journaling code fails.
> + * This (static+dynamic fields) is a temporary solution and can be avoided
> + * once the file system becomes stable and nova_get_block() returns correct
> + * pointers even for offset 0.
> + */
> +struct nova_super_block {
> + /* static fields. they never change after file system creation.
> +  * checksum only validates up to s_start_dynamic field below
> +  */
> + __le32  s_sum;  /* checksum of this sb */
> + __le32  s_magic;/* magic signature */
> + __le32  s_padding32;
> + __le32  s_blocksize;/* blocksize in bytes */
> + __le64  s_size; /* total size of fs in bytes */
> + chars_volume_name[16];  /* volume name */
> +
> + /* all the dynamic fields should go here */
> + __le64  s_epoch_id; /* Epoch ID */
> +
> + /* s_mtime and s_wtime should be together and their order should not be
> +  * changed. we use an 8 byte write to update both of them atomically
> +  */
> + __le32  s_mtime;/* mount time */
> + __le32  s_wtime;/* write time */

Hmmm, 32-bit timestamps?  2038 isn't that far away...

> +} __attribute((__packed__));
> +
> +#define NOVA_SB_SIZE 512   /* must be power of two */
> +
> +/* === Reserved blocks = */
> +
> +/*
> + * Page 0 contains super blocks;
> + * Page 1 contains reserved inodes;
> + * Page 2 - 15 are reserved.
> + * Page 16 - 31 contain pointers to inode tables.
> + * Page 32 - 47 contain pointers to journal pages.
> + */
> +#define  HEAD_RESERVED_BLOCKS64
> +#define  NUM_JOURNAL_PAGES   16
> +
> +#define  SUPER_BLOCK_START   0 // Superblock
> +#define  RESERVE_INODE_START 1 // Reserved inodes
> +#define  INODE_TABLE_START   16 // inode table pointers
> +#define  JOURNAL_START   32 // journal pointer table
> +
> +/* For replica super block and replica reserved inodes */
> +#define  TAIL_RESERVED_BLOCKS2
> +
> +/* === Reserved inodes = */
> +
> +/* We have space for 31 reserved inodes */
> +#define NOVA_ROOT_INO(1)
> +#define NOVA_INODETABLE_INO  (2) /* Fake inode associated with inode
> +  * stroage.  We need this because our
> +  * allocator requires inode to be
> +  * associated with each allocation.
> +  * The data actually lives in linked
> +  * lists in INODE_TABLE_START. */
> +#define NOVA_BLOCKNODE_INO   (3) /* Storage for allocator state */
> +#define NOVA_LITEJOURNAL_INO (4) /* Storage for lightweight journals */
> +#define NOVA_INODELIST_INO   (5) /* Storage for Inode free list */
> +
> +
> +/* Normal inode starts at 32 */
> +#define NOVA_NORMAL_INODE_START  (32)

I've been wondering this whole time, why not make the inode number the
byte offset into the pmem?  Then you don't have to lose the last 8 bytes
of each inode block to point to the next one.

--D

> +
> +
> +
> +/*
> + * NOVA super-block data in DRAM
> + */
> +struct nova_sb_info {
> + struct super_block *sb; /* VFS super block */
> + struct nova_super_block *nova_sb;   /* DRAM copy of SB */
> + struct block_device *s_bdev;
> + struct dax_device *s_dax_dev;
> +
> + /*
> +  * base physical and virtual address of NOVA 

[RFC v2 03/83] Add super.h.

2018-03-10 Thread Andiry Xu
From: Andiry Xu 

This header file defines NOVA persistent and volatile superblock
data structures.

It also defines NOVA block layout:

Page 0: Superblock
Page 1: Reserved inodes
Page 2 - 15: Reserved
Page 16 - 31: Inode table pointers
Page 32 - 47: Journal address pointers
Page 48 - 63: Reserved
Pages n-2: Replicate reserved inodes
Pages n-1: Replicate superblock

Other pages are for normal inodes, logs and data.

Signed-off-by: Andiry Xu 
---
 fs/nova/super.h | 149 
 1 file changed, 149 insertions(+)
 create mode 100644 fs/nova/super.h

diff --git a/fs/nova/super.h b/fs/nova/super.h
new file mode 100644
index 000..cb53908
--- /dev/null
+++ b/fs/nova/super.h
@@ -0,0 +1,149 @@
+#ifndef __SUPER_H
+#define __SUPER_H
+/*
+ * Structure of the NOVA super block in PMEM
+ *
+ * The fields are partitioned into static and dynamic fields. The static fields
+ * never change after file system creation. This was primarily done because
+ * nova_get_block() returns NULL if the block offset is 0 (helps in catching
+ * bugs). So if we modify any field using journaling (for consistency), we
+ * will have to modify s_sum which is at offset 0. So journaling code fails.
+ * This (static+dynamic fields) is a temporary solution and can be avoided
+ * once the file system becomes stable and nova_get_block() returns correct
+ * pointers even for offset 0.
+ */
+struct nova_super_block {
+   /* static fields. they never change after file system creation.
+* checksum only validates up to s_start_dynamic field below
+*/
+   __le32  s_sum;  /* checksum of this sb */
+   __le32  s_magic;/* magic signature */
+   __le32  s_padding32;
+   __le32  s_blocksize;/* blocksize in bytes */
+   __le64  s_size; /* total size of fs in bytes */
+   chars_volume_name[16];  /* volume name */
+
+   /* all the dynamic fields should go here */
+   __le64  s_epoch_id; /* Epoch ID */
+
+   /* s_mtime and s_wtime should be together and their order should not be
+* changed. we use an 8 byte write to update both of them atomically
+*/
+   __le32  s_mtime;/* mount time */
+   __le32  s_wtime;/* write time */
+} __attribute((__packed__));
+
+#define NOVA_SB_SIZE 512   /* must be power of two */
+
+/* === Reserved blocks = */
+
+/*
+ * Page 0 contains super blocks;
+ * Page 1 contains reserved inodes;
+ * Page 2 - 15 are reserved.
+ * Page 16 - 31 contain pointers to inode tables.
+ * Page 32 - 47 contain pointers to journal pages.
+ */
+#defineHEAD_RESERVED_BLOCKS64
+#defineNUM_JOURNAL_PAGES   16
+
+#defineSUPER_BLOCK_START   0 // Superblock
+#defineRESERVE_INODE_START 1 // Reserved inodes
+#defineINODE_TABLE_START   16 // inode table pointers
+#defineJOURNAL_START   32 // journal pointer table
+
+/* For replica super block and replica reserved inodes */
+#defineTAIL_RESERVED_BLOCKS2
+
+/* === Reserved inodes = */
+
+/* We have space for 31 reserved inodes */
+#define NOVA_ROOT_INO  (1)
+#define NOVA_INODETABLE_INO(2) /* Fake inode associated with inode
+* stroage.  We need this because our
+* allocator requires inode to be
+* associated with each allocation.
+* The data actually lives in linked
+* lists in INODE_TABLE_START. */
+#define NOVA_BLOCKNODE_INO (3) /* Storage for allocator state */
+#define NOVA_LITEJOURNAL_INO   (4) /* Storage for lightweight journals */
+#define NOVA_INODELIST_INO (5) /* Storage for Inode free list */
+
+
+/* Normal inode starts at 32 */
+#define NOVA_NORMAL_INODE_START  (32)
+
+
+
+/*
+ * NOVA super-block data in DRAM
+ */
+struct nova_sb_info {
+   struct super_block *sb; /* VFS super block */
+   struct nova_super_block *nova_sb;   /* DRAM copy of SB */
+   struct block_device *s_bdev;
+   struct dax_device *s_dax_dev;
+
+   /*
+* base physical and virtual address of NOVA (which is also
+* the pointer to the super block)
+*/
+   phys_addr_t phys_addr;
+   void*virt_addr;
+   void*replica_reserved_inodes_addr;
+   void*replica_sb_addr;
+
+   unsigned long   num_blocks;
+
+   /* Mount options */
+   unsigned long   bpi;
+   unsigned long   blocksize;
+   unsigned long   initsize;
+   unsigned long   s_mount_opt;
+

[RFC v2 03/83] Add super.h.

2018-03-10 Thread Andiry Xu
From: Andiry Xu 

This header file defines NOVA persistent and volatile superblock
data structures.

It also defines NOVA block layout:

Page 0: Superblock
Page 1: Reserved inodes
Page 2 - 15: Reserved
Page 16 - 31: Inode table pointers
Page 32 - 47: Journal address pointers
Page 48 - 63: Reserved
Pages n-2: Replicate reserved inodes
Pages n-1: Replicate superblock

Other pages are for normal inodes, logs and data.

Signed-off-by: Andiry Xu 
---
 fs/nova/super.h | 149 
 1 file changed, 149 insertions(+)
 create mode 100644 fs/nova/super.h

diff --git a/fs/nova/super.h b/fs/nova/super.h
new file mode 100644
index 000..cb53908
--- /dev/null
+++ b/fs/nova/super.h
@@ -0,0 +1,149 @@
+#ifndef __SUPER_H
+#define __SUPER_H
+/*
+ * Structure of the NOVA super block in PMEM
+ *
+ * The fields are partitioned into static and dynamic fields. The static fields
+ * never change after file system creation. This was primarily done because
+ * nova_get_block() returns NULL if the block offset is 0 (helps in catching
+ * bugs). So if we modify any field using journaling (for consistency), we
+ * will have to modify s_sum which is at offset 0. So journaling code fails.
+ * This (static+dynamic fields) is a temporary solution and can be avoided
+ * once the file system becomes stable and nova_get_block() returns correct
+ * pointers even for offset 0.
+ */
+struct nova_super_block {
+   /* static fields. they never change after file system creation.
+* checksum only validates up to s_start_dynamic field below
+*/
+   __le32  s_sum;  /* checksum of this sb */
+   __le32  s_magic;/* magic signature */
+   __le32  s_padding32;
+   __le32  s_blocksize;/* blocksize in bytes */
+   __le64  s_size; /* total size of fs in bytes */
+   chars_volume_name[16];  /* volume name */
+
+   /* all the dynamic fields should go here */
+   __le64  s_epoch_id; /* Epoch ID */
+
+   /* s_mtime and s_wtime should be together and their order should not be
+* changed. we use an 8 byte write to update both of them atomically
+*/
+   __le32  s_mtime;/* mount time */
+   __le32  s_wtime;/* write time */
+} __attribute((__packed__));
+
+#define NOVA_SB_SIZE 512   /* must be power of two */
+
+/* === Reserved blocks = */
+
+/*
+ * Page 0 contains super blocks;
+ * Page 1 contains reserved inodes;
+ * Page 2 - 15 are reserved.
+ * Page 16 - 31 contain pointers to inode tables.
+ * Page 32 - 47 contain pointers to journal pages.
+ */
+#defineHEAD_RESERVED_BLOCKS64
+#defineNUM_JOURNAL_PAGES   16
+
+#defineSUPER_BLOCK_START   0 // Superblock
+#defineRESERVE_INODE_START 1 // Reserved inodes
+#defineINODE_TABLE_START   16 // inode table pointers
+#defineJOURNAL_START   32 // journal pointer table
+
+/* For replica super block and replica reserved inodes */
+#defineTAIL_RESERVED_BLOCKS2
+
+/* === Reserved inodes = */
+
+/* We have space for 31 reserved inodes */
+#define NOVA_ROOT_INO  (1)
+#define NOVA_INODETABLE_INO(2) /* Fake inode associated with inode
+* stroage.  We need this because our
+* allocator requires inode to be
+* associated with each allocation.
+* The data actually lives in linked
+* lists in INODE_TABLE_START. */
+#define NOVA_BLOCKNODE_INO (3) /* Storage for allocator state */
+#define NOVA_LITEJOURNAL_INO   (4) /* Storage for lightweight journals */
+#define NOVA_INODELIST_INO (5) /* Storage for Inode free list */
+
+
+/* Normal inode starts at 32 */
+#define NOVA_NORMAL_INODE_START  (32)
+
+
+
+/*
+ * NOVA super-block data in DRAM
+ */
+struct nova_sb_info {
+   struct super_block *sb; /* VFS super block */
+   struct nova_super_block *nova_sb;   /* DRAM copy of SB */
+   struct block_device *s_bdev;
+   struct dax_device *s_dax_dev;
+
+   /*
+* base physical and virtual address of NOVA (which is also
+* the pointer to the super block)
+*/
+   phys_addr_t phys_addr;
+   void*virt_addr;
+   void*replica_reserved_inodes_addr;
+   void*replica_sb_addr;
+
+   unsigned long   num_blocks;
+
+   /* Mount options */
+   unsigned long   bpi;
+   unsigned long   blocksize;
+   unsigned long   initsize;
+   unsigned long   s_mount_opt;
+   kuid_t  uid;/* Mount uid