Re: Volume appears full but TB's of space available

2017-04-07 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 07 Apr 2017 07:41:22 -0400 as
excerpted:

> 2. Results from 'btrfs scrub'.  This is somewhat tricky because scrub is
> either asynchronous or blocks for a _long_ time.  The simplest option
> I've found is to fire off an asynchronous scrub to run during down-time,
> and then schedule recurring checks with 'btrfs scrub status'.  On the
> plus side, 'btrfs scrub status' already returns non-zero if the scrub
> found errors.

This is (one place) where my "keep it small enough to be in-practice-
manageable" comes in.

I always run my scrubs with -B (don't background, always, because I've 
scripted it), and they normally come back within a minute. =:^)

But that's because I'm running multiple btrfs pair-device raid1 on a pair 
of partitioned SSDs, with each independent btrfs built on a partition 
from each ssd, with all partitions under 50 GiB.  So scrubs takes less 
than a minute to run (on the under 1 GiB /var/log, it returns effectively 
immediately, as soon as I hit enter on the command), but that's not 
entirely surprising at the sizes of the ssd-based btrfs' I am running.

When scrubs (and balances, and checks) come back in a minute or so, it 
makes maintenance /so/ much less of a hassle. =:^)

And the generally single-purpose and relatively small size of each 
filesystem means I can, for instance, keep / (with all the system libs, 
bins, manpages, and the installed-package database, among other things) 
mounted read-only by default, and keep the updates partition (gentoo so 
that's the gentoo and overlay trees, the sources and binpkg cache, ccache 
cache, etc) and (large non-ssd/non-btrfs) media partitions unmounted by 
default.

Which in turn means when something /does/ go wrong, as long as it wasn't 
a physical device, there's much less data at risk, because most of it was 
probably either unmounted, or mounted read-only.

Which in turn means I don't have to worry about scrub/check or other 
repair on those filesystems at all, only the ones that were actually 
mounted writable.  And as mentioned, those scrub and check fast enough 
that I can literally wait at the terminal for command completion. =:^)

Of course my setup's what most would call partitioned to the extreme, but 
it does have its advantages, and it works well for me, which after all is 
the important thing for /my/ setup.

But the more generic point remains, if you setup multi-TB filesystems 
that take days or weeks for a maintenance command to complete, running 
those maintenance commands isn't going to be something done as often as 
one arguably should, and rebuilding from a filesystem or device failure 
is going to take far longer than one would like, as well.  We've seen the 
reports here.  If that's what you're doing, strongly consider breaking 
your filesystems down to something rather more manageable, say a couple 
TiB each.  Broken along natural usage lines, it can save a lot on the  
caffeine and headache pills when something does go wrong.

Unless of course like one poster here, you're handling double-digit-TB 
super-collider data files.  Those tend to be a bit difficult to store on 
sub-double-digit-TB filesystems.  =:^)  But that's the other extreme from 
what I've done here, and he actually has a good /reason/ for /his/
double-digit- or even triple-digit-TB filesystems.  There's not much to 
be done about his use-case, and indeed, AFAIK he decided btrfs simply 
isn't stable and mature enough for that use-case yet, tho I believe he's 
using it for some other, more minor and less gargantuan use-cases.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] trace: Make trace_hwlat timestamp y2038 safe

2017-04-07 Thread Deepa Dinamani
>> - trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%ld.%09ld",
>> + trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu 
>> ts:%lld.%09ld",
>>field->seqnum,
>>field->duration,
>>field->outer_duration,
>> -  field->timestamp.tv_sec,
>> +  (long long)field->timestamp.tv_sec,
>
> Refresh my memory. We need the cast because on 64 bit boxes
> timestamp.tv_sec is just a long?

This is only required until we change the definition of timespec64.
Right now it is defined as

#if __BITS_PER_LONG == 64
# define timespec64 timespec
#else
struct timespec64 {
time64_t tv_sec;
long tv_nsec;
};
#endif

And timespec.tv_sec is just long int on 64 bit machines.
This is why we need the cast now.

We will probably change this and only define __kernel_timespec instead
of timespec, leaving only one definition of timespec64.
At that time, we will not need this.

-Deepa
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] trace: Make trace_hwlat timestamp y2038 safe

2017-04-07 Thread Steven Rostedt
On Fri,  7 Apr 2017 17:57:00 -0700
Deepa Dinamani  wrote:

> struct timespec is not y2038 safe on 32 bit machines
> and needs to be replaced by struct timespec64
> in order to represent times beyond year 2038 on such
> machines.
> 
> Fix all the timestamp representation in struct trace_hwlat
> and all the corresponding implementations.
> 

> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 02a4aeb..08f9bab 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -4,7 +4,6 @@
>   * Copyright (C) 2008 Red Hat Inc, Steven Rostedt 
>   *
>   */
> -
>  #include 
>  #include 
>  #include 
> @@ -1161,11 +1160,11 @@ trace_hwlat_print(struct trace_iterator *iter, int 
> flags,
>  
>   trace_assign_type(field, entry);
>  
> - trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%ld.%09ld",
> + trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%lld.%09ld",
>field->seqnum,
>field->duration,
>field->outer_duration,
> -  field->timestamp.tv_sec,
> +  (long long)field->timestamp.tv_sec,

Refresh my memory. We need the cast because on 64 bit boxes
timestamp.tv_sec is just a long?

Other than that.

Reviewed-by: Steven Rostedt (VMware) 

-- Steve

>field->timestamp.tv_nsec);
>  
>   if (field->nmi_count) {
> @@ -1195,10 +1194,10 @@ trace_hwlat_raw(struct trace_iterator *iter, int 
> flags,
>  
>   trace_assign_type(field, iter->ent);
>  
> - trace_seq_printf(s, "%llu %lld %ld %09ld %u\n",
> + trace_seq_printf(s, "%llu %lld %lld %09ld %u\n",
>field->duration,
>field->outer_duration,
> -  field->timestamp.tv_sec,
> +  (long long)field->timestamp.tv_sec,
>field->timestamp.tv_nsec,
>field->seqnum);
>  

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/12] trace: Make trace_hwlat timestamp y2038 safe

2017-04-07 Thread Deepa Dinamani
struct timespec is not y2038 safe on 32 bit machines
and needs to be replaced by struct timespec64
in order to represent times beyond year 2038 on such
machines.

Fix all the timestamp representation in struct trace_hwlat
and all the corresponding implementations.

Signed-off-by: Deepa Dinamani 
---
 kernel/trace/trace_entries.h |  6 +++---
 kernel/trace/trace_hwlat.c   | 14 +++---
 kernel/trace/trace_output.c  |  9 -
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index c203ac4..adcdbbe 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -348,14 +348,14 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
__field(u64,duration)
__field(u64,outer_duration  )
__field(u64,nmi_total_ts)
-   __field_struct( struct timespec,timestamp   )
-   __field_desc(   long,   timestamp,  tv_sec  )
+   __field_struct( struct timespec64,  timestamp   )
+   __field_desc(   s64,timestamp,  tv_sec  )
__field_desc(   long,   timestamp,  tv_nsec )
__field(unsigned int,   nmi_count   )
__field(unsigned int,   seqnum  )
),
 
-   
F_printk("cnt:%u\tts:%010lu.%010lu\tinner:%llu\touter:%llunmi-ts:%llu\tnmi-count:%u\n",
+   
F_printk("cnt:%u\tts:%010llu.%010lu\tinner:%llu\touter:%llunmi-ts:%llu\tnmi-count:%u\n",
 __entry->seqnum,
 __entry->tv_sec,
 __entry->tv_nsec,
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 21ea6ae..d7c8e4e 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -79,12 +79,12 @@ static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * 
NSEC_PER_USEC;
 
 /* Individual latency samples are stored here when detected. */
 struct hwlat_sample {
-   u64 seqnum; /* unique sequence */
-   u64 duration;   /* delta */
-   u64 outer_duration; /* delta (outer loop) */
-   u64 nmi_total_ts;   /* Total time spent in NMIs */
-   struct timespec timestamp;  /* wall time */
-   int nmi_count;  /* # NMIs during this sample */
+   u64 seqnum; /* unique sequence */
+   u64 duration;   /* delta */
+   u64 outer_duration; /* delta (outer loop) */
+   u64 nmi_total_ts;   /* Total time spent in NMIs */
+   struct timespec64   timestamp;  /* wall time */
+   int nmi_count;  /* # NMIs during this sample */
 };
 
 /* keep the global state somewhere. */
@@ -250,7 +250,7 @@ static int get_sample(void)
s.seqnum = hwlat_data.count;
s.duration = sample;
s.outer_duration = outer_sample;
-   s.timestamp = CURRENT_TIME;
+   ktime_get_real_ts64();
s.nmi_total_ts = nmi_total_ts;
s.nmi_count = nmi_count;
trace_hwlat_sample();
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02a4aeb..08f9bab 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -4,7 +4,6 @@
  * Copyright (C) 2008 Red Hat Inc, Steven Rostedt 
  *
  */
-
 #include 
 #include 
 #include 
@@ -1161,11 +1160,11 @@ trace_hwlat_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, entry);
 
-   trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%ld.%09ld",
+   trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%lld.%09ld",
 field->seqnum,
 field->duration,
 field->outer_duration,
-field->timestamp.tv_sec,
+(long long)field->timestamp.tv_sec,
 field->timestamp.tv_nsec);
 
if (field->nmi_count) {
@@ -1195,10 +1194,10 @@ trace_hwlat_raw(struct trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   trace_seq_printf(s, "%llu %lld %ld %09ld %u\n",
+   trace_seq_printf(s, "%llu %lld %lld %09ld %u\n",
 field->duration,
 field->outer_duration,
-field->timestamp.tv_sec,
+(long long)field->timestamp.tv_sec,
 field->timestamp.tv_nsec,
 field->seqnum);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH 05/12] fs: ufs: Use ktime_get_real_ts64() for birthtime

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME is not y2038 safe.
Replace it with ktime_get_real_ts64().
Inode time formats are already 64 bit long and
accommodates time64_t.

Signed-off-by: Deepa Dinamani 
---
 fs/ufs/ialloc.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ufs/ialloc.c b/fs/ufs/ialloc.c
index 9774555..d1dd8cc 100644
--- a/fs/ufs/ialloc.c
+++ b/fs/ufs/ialloc.c
@@ -176,6 +176,7 @@ struct inode *ufs_new_inode(struct inode *dir, umode_t mode)
struct ufs_cg_private_info * ucpi;
struct ufs_cylinder_group * ucg;
struct inode * inode;
+   struct timespec64 ts;
unsigned cg, bit, i, j, start;
struct ufs_inode_info *ufsi;
int err = -ENOSPC;
@@ -323,8 +324,9 @@ struct inode *ufs_new_inode(struct inode *dir, umode_t mode)
lock_buffer(bh);
ufs2_inode = (struct ufs2_inode *)bh->b_data;
ufs2_inode += ufs_inotofsbo(inode->i_ino);
-   ufs2_inode->ui_birthtime = cpu_to_fs64(sb, CURRENT_TIME.tv_sec);
-   ufs2_inode->ui_birthnsec = cpu_to_fs32(sb, 
CURRENT_TIME.tv_nsec);
+   ktime_get_real_ts64();
+   ufs2_inode->ui_birthtime = cpu_to_fs64(sb, ts.tv_sec);
+   ufs2_inode->ui_birthnsec = cpu_to_fs32(sb, ts.tv_nsec);
mark_buffer_dirty(bh);
unlock_buffer(bh);
if (sb->s_flags & MS_SYNCHRONOUS)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] fs: f2fs: Use ktime_get_real_seconds for sit_info times

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME_SEC is not y2038 safe.

Replace use of CURRENT_TIME_SEC with ktime_get_real_seconds
in segment timestamps used by GC algorithm including the
segment mtime timestamps.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
---
 fs/f2fs/segment.c | 2 +-
 fs/f2fs/segment.h | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 010324c..0531500 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -2678,7 +2678,7 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
sit_i->dirty_sentries = 0;
sit_i->sents_per_block = SIT_ENTRY_PER_BLOCK;
sit_i->elapsed_time = le64_to_cpu(sbi->ckpt->elapsed_time);
-   sit_i->mounted_time = CURRENT_TIME_SEC.tv_sec;
+   sit_i->mounted_time = ktime_get_real_seconds();
mutex_init(_i->sentry_lock);
return 0;
 }
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 57e36c1..156afc3 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -692,8 +692,9 @@ static inline void set_to_next_sit(struct sit_info *sit_i, 
unsigned int start)
 static inline unsigned long long get_mtime(struct f2fs_sb_info *sbi)
 {
struct sit_info *sit_i = SIT_I(sbi);
-   return sit_i->elapsed_time + CURRENT_TIME_SEC.tv_sec -
-   sit_i->mounted_time;
+   time64_t now = ktime_get_real_seconds();
+
+   return sit_i->elapsed_time + now - sit_i->mounted_time;
 }
 
 static inline void set_summary(struct f2fs_summary *sum, nid_t nid,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] lustre: Replace CURRENT_TIME macro

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME macro is not y2038 safe on 32 bit systems.

The patch replaces all the uses of CURRENT_TIME by
current_time() for filesystem times, and ktime_get_*
functions for others.

struct timespec is also not y2038 safe.
Retain timespec for timestamp representation here as lustre
uses it internally everywhere.
These references will be changed to use struct timespec64
in a separate patch.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. current_time() is also planned to be transitioned
to y2038 safe behavior along with this change.

CURRENT_TIME macro will be deleted before merging the
aforementioned change.

Signed-off-by: Deepa Dinamani 
---
 drivers/staging/lustre/lustre/llite/llite_lib.c | 6 +++---
 drivers/staging/lustre/lustre/osc/osc_io.c  | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c 
b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 7b80040..2b4b6b9 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -1472,17 +1472,17 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr 
*attr, bool hsm_import)
 
/* We mark all of the fields "set" so MDS/OST does not re-set them */
if (attr->ia_valid & ATTR_CTIME) {
-   attr->ia_ctime = CURRENT_TIME;
+   attr->ia_ctime = current_time(inode);
attr->ia_valid |= ATTR_CTIME_SET;
}
if (!(attr->ia_valid & ATTR_ATIME_SET) &&
(attr->ia_valid & ATTR_ATIME)) {
-   attr->ia_atime = CURRENT_TIME;
+   attr->ia_atime = current_time(inode);
attr->ia_valid |= ATTR_ATIME_SET;
}
if (!(attr->ia_valid & ATTR_MTIME_SET) &&
(attr->ia_valid & ATTR_MTIME)) {
-   attr->ia_mtime = CURRENT_TIME;
+   attr->ia_mtime = current_time(inode);
attr->ia_valid |= ATTR_MTIME_SET;
}
 
diff --git a/drivers/staging/lustre/lustre/osc/osc_io.c 
b/drivers/staging/lustre/lustre/osc/osc_io.c
index f991bee..cbab800 100644
--- a/drivers/staging/lustre/lustre/osc/osc_io.c
+++ b/drivers/staging/lustre/lustre/osc/osc_io.c
@@ -216,7 +216,7 @@ static int osc_io_submit(const struct lu_env *env,
struct cl_object *obj = ios->cis_obj;
 
cl_object_attr_lock(obj);
-   attr->cat_mtime = LTIME_S(CURRENT_TIME);
+   attr->cat_mtime = ktime_get_real_seconds();
attr->cat_ctime = attr->cat_mtime;
cl_object_attr_update(env, obj, attr, CAT_MTIME | CAT_CTIME);
cl_object_attr_unlock(obj);
@@ -256,7 +256,7 @@ static void osc_page_touch_at(const struct lu_env *env,
   kms > loi->loi_kms ? "" : "not ", loi->loi_kms, kms,
   loi->loi_lvb.lvb_size);
 
-   attr->cat_ctime = LTIME_S(CURRENT_TIME);
+   attr->cat_ctime = ktime_get_real_seconds();
attr->cat_mtime = attr->cat_ctime;
valid = CAT_MTIME | CAT_CTIME;
if (kms > loi->loi_kms) {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12] time: Delete CURRENT_TIME_SEC and CURRENT_TIME

2017-04-07 Thread Deepa Dinamani
All uses of CURRENT_TIME_SEC and CURRENT_TIME macros have
been replaced by other time functions. These macros are
also not y2038 safe.
And, all their use cases can be fulfilled by y2038 safe
ktime_get_* variants.

Signed-off-by: Deepa Dinamani 
Acked-by: John Stultz 
Reviewed-by: Arnd Bergmann 
---
 include/linux/time.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index 23f0f5c..c0543f5 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -151,9 +151,6 @@ static inline bool timespec_inject_offset_valid(const 
struct timespec *ts)
return true;
 }
 
-#define CURRENT_TIME   (current_kernel_time())
-#define CURRENT_TIME_SEC   ((struct timespec) { get_seconds(), 0 })
-
 /* Some architectures do not supply their own clocksource.
  * This is mainly the case in architectures that get their
  * inter-tick times by reading the counter on their interval
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12] apparmorfs: Replace CURRENT_TIME with current_time()

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME macro is not y2038 safe on 32 bit systems.

The patch replaces all the uses of CURRENT_TIME by
current_time().

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. current_time() is also planned to be transitioned
to y2038 safe behavior along with this change.

CURRENT_TIME macro will be deleted before merging the
aforementioned change.

Signed-off-by: Deepa Dinamani 
---
 security/apparmor/apparmorfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index be0b498..4f6ac9d 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -1357,7 +1357,7 @@ static int aa_mk_null_file(struct dentry *parent)
 
inode->i_ino = get_next_ino();
inode->i_mode = S_IFCHR | S_IRUGO | S_IWUGO;
-   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
init_special_inode(inode, S_IFCHR | S_IRUGO | S_IWUGO,
   MKDEV(MEM_MAJOR, 3));
d_instantiate(dentry, inode);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/12] fs: btrfs: Use ktime_get_real_ts for root ctime

2017-04-07 Thread Deepa Dinamani
btrfs_root_item maintains the ctime for root updates.
This is not part of vfs_inode.

Since current_time() uses struct inode* as an argument
as Linus suggested, this cannot be used to update root
times unless, we modify the signature to use inode.

Since btrfs uses nanosecond time granularity, it can also
use ktime_get_real_ts directly to obtain timestamp for
the root. It is necessary to use the timespec time api
here because the same btrfs_set_stack_timespec_*() apis
are used for vfs inode times as well. These can be
transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.

Signed-off-by: Deepa Dinamani 
Acked-by: David Sterba 
Reviewed-by: Arnd Bergmann 
---
 fs/btrfs/root-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index a08224e..7d6bc30 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -501,8 +501,9 @@ void btrfs_update_root_times(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
struct btrfs_root_item *item = >root_item;
-   struct timespec ct = current_fs_time(root->fs_info->sb);
+   struct timespec ct;
 
+   ktime_get_real_ts();
spin_lock(>root_item_lock);
btrfs_set_root_ctransid(item, trans->transid);
btrfs_set_stack_timespec_sec(>ctime, ct.tv_sec);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] fs: ubifs: Replace CURRENT_TIME_SEC with current_time

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME_SEC is not y2038 safe. current_time() will
be transitioned to use 64 bit time along with vfs in a
separate patch.
There is no plan to transition CURRENT_TIME_SEC to use
y2038 safe time interfaces.

current_time() returns timestamps according to the
granularities set in the inode's super_block.
The granularity check to call current_fs_time() or
CURRENT_TIME_SEC is not required.

Use current_time() directly to update inode timestamp.
Use timespec_trunc during file system creation, before
the first inode is created.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
---
 fs/ubifs/dir.c   | 12 ++--
 fs/ubifs/file.c  | 12 ++--
 fs/ubifs/ioctl.c |  2 +-
 fs/ubifs/misc.h  | 10 --
 fs/ubifs/sb.c| 14 ++
 fs/ubifs/xattr.c |  6 +++---
 6 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 30825d88..8510d79 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -121,7 +121,7 @@ struct inode *ubifs_new_inode(struct ubifs_info *c, struct 
inode *dir,
 
inode_init_owner(inode, dir, mode);
inode->i_mtime = inode->i_atime = inode->i_ctime =
-ubifs_current_time(inode);
+current_time(inode);
inode->i_mapping->nrpages = 0;
 
switch (mode & S_IFMT) {
@@ -750,7 +750,7 @@ static int ubifs_link(struct dentry *old_dentry, struct 
inode *dir,
lock_2_inodes(dir, inode);
inc_nlink(inode);
ihold(inode);
-   inode->i_ctime = ubifs_current_time(inode);
+   inode->i_ctime = current_time(inode);
dir->i_size += sz_change;
dir_ui->ui_size = dir->i_size;
dir->i_mtime = dir->i_ctime = inode->i_ctime;
@@ -823,7 +823,7 @@ static int ubifs_unlink(struct inode *dir, struct dentry 
*dentry)
}
 
lock_2_inodes(dir, inode);
-   inode->i_ctime = ubifs_current_time(dir);
+   inode->i_ctime = current_time(dir);
drop_nlink(inode);
dir->i_size -= sz_change;
dir_ui->ui_size = dir->i_size;
@@ -927,7 +927,7 @@ static int ubifs_rmdir(struct inode *dir, struct dentry 
*dentry)
}
 
lock_2_inodes(dir, inode);
-   inode->i_ctime = ubifs_current_time(dir);
+   inode->i_ctime = current_time(dir);
clear_nlink(inode);
drop_nlink(dir);
dir->i_size -= sz_change;
@@ -1405,7 +1405,7 @@ static int do_rename(struct inode *old_dir, struct dentry 
*old_dentry,
 * Like most other Unix systems, set the @i_ctime for inodes on a
 * rename.
 */
-   time = ubifs_current_time(old_dir);
+   time = current_time(old_dir);
old_inode->i_ctime = time;
 
/* We must adjust parent link count when renaming directories */
@@ -1578,7 +1578,7 @@ static int ubifs_xrename(struct inode *old_dir, struct 
dentry *old_dentry,
 
lock_4_inodes(old_dir, new_dir, NULL, NULL);
 
-   time = ubifs_current_time(old_dir);
+   time = current_time(old_dir);
fst_inode->i_ctime = time;
snd_inode->i_ctime = time;
old_dir->i_mtime = old_dir->i_ctime = time;
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index d9ae86f..2cda3d6 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1196,7 +1196,7 @@ static int do_truncation(struct ubifs_info *c, struct 
inode *inode,
mutex_lock(>ui_mutex);
ui->ui_size = inode->i_size;
/* Truncation changes inode [mc]time */
-   inode->i_mtime = inode->i_ctime = ubifs_current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
/* Other attributes may be changed at the same time as well */
do_attr_changes(inode, attr);
err = ubifs_jnl_truncate(c, inode, old_size, new_size);
@@ -1243,7 +1243,7 @@ static int do_setattr(struct ubifs_info *c, struct inode 
*inode,
mutex_lock(>ui_mutex);
if (attr->ia_valid & ATTR_SIZE) {
/* Truncation changes inode [mc]time */
-   inode->i_mtime = inode->i_ctime = ubifs_current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
/* 'truncate_setsize()' changed @i_size, update @ui_size */
ui->ui_size = inode->i_size;
}
@@ -1420,7 +1420,7 @@ int ubifs_update_time(struct inode *inode, struct 
timespec *time,
  */
 static int update_mctime(struct inode *inode)
 {
-   struct timespec now = ubifs_current_time(inode);
+   struct timespec now = current_time(inode);
struct ubifs_inode *ui = ubifs_inode(inode);
struct ubifs_info *c = inode->i_sb->s_fs_info;
 
@@ -1434,7 +1434,7 @@ static int update_mctime(struct inode *inode)
return err;
 
mutex_lock(>ui_mutex);
-   inode->i_mtime = inode->i_ctime = ubifs_current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
release = ui->dirty;
  

[PATCH 12/12] time: Delete current_fs_time() function

2017-04-07 Thread Deepa Dinamani
All uses of the current_fs_time() function have been
replaced by other time interfaces.

And, its use cases can be fulfilled by current_time()
or ktime_get_* variants.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
---
 include/linux/fs.h |  1 -
 kernel/time/time.c | 14 --
 2 files changed, 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f1d7347..cce6c57 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1430,7 +1430,6 @@ static inline void i_gid_write(struct inode *inode, gid_t 
gid)
inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
 }
 
-extern struct timespec current_fs_time(struct super_block *sb);
 extern struct timespec current_time(struct inode *inode);
 
 /*
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 25bdd25..cf69cca 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -230,20 +230,6 @@ SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p)
return copy_to_user(txc_p, , sizeof(struct timex)) ? -EFAULT : ret;
 }
 
-/**
- * current_fs_time - Return FS time
- * @sb: Superblock.
- *
- * Return the current time truncated to the time granularity supported by
- * the fs.
- */
-struct timespec current_fs_time(struct super_block *sb)
-{
-   struct timespec now = current_kernel_time();
-   return timespec_trunc(now, sb->s_time_gran);
-}
-EXPORT_SYMBOL(current_fs_time);
-
 /*
  * Convert jiffies to milliseconds and back.
  *
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] fs: cifs: Replace CURRENT_TIME by other appropriate apis

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME macro is not y2038 safe on 32 bit systems.

The patch replaces all the uses of CURRENT_TIME by
current_time() for filesystem times, and ktime_get_*
functions for authentication timestamps and timezone
calculations.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe.

CURRENT_TIME macro will be deleted before merging the
aforementioned change.

The inode timestamps read from the server are assumed
to have correct granularity and range.

The patch also assumes that the difference between server and
client times lie in the range INT_MIN..INT_MAX. This is valid
because this is the difference between current times between
server and client, and the largest timezone difference is in the
range of one day.

All cifs timestamps currently use timespec representation internally.
Authentication and timezone timestamps can also be transitioned into
using timespec64 when all other timestamps for cifs is transitioned
to use timespec64.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
---
 fs/cifs/cifsencrypt.c |  4 +++-
 fs/cifs/cifssmb.c | 10 +-
 fs/cifs/inode.c   | 28 +++-
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index 058ac9b..68abbb0 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -478,6 +478,7 @@ find_timestamp(struct cifs_ses *ses)
unsigned char *blobptr;
unsigned char *blobend;
struct ntlmssp2_name *attrptr;
+   struct timespec ts;
 
if (!ses->auth_key.len || !ses->auth_key.response)
return 0;
@@ -502,7 +503,8 @@ find_timestamp(struct cifs_ses *ses)
blobptr += attrsize; /* advance attr value */
}
 
-   return cpu_to_le64(cifs_UnixTimeToNT(CURRENT_TIME));
+   ktime_get_real_ts();
+   return cpu_to_le64(cifs_UnixTimeToNT(ts));
 }
 
 static int calc_ntlmv2_hash(struct cifs_ses *ses, char *ntlmv2_hash,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 0669506..2f279b7 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -478,14 +478,14 @@ decode_lanman_negprot_rsp(struct TCP_Server_Info *server, 
NEGOTIATE_RSP *pSMBr)
 * this requirement.
 */
int val, seconds, remain, result;
-   struct timespec ts, utc;
-   utc = CURRENT_TIME;
+   struct timespec ts;
+   unsigned long utc = ktime_get_real_seconds();
ts = cnvrtDosUnixTm(rsp->SrvTime.Date,
rsp->SrvTime.Time, 0);
cifs_dbg(FYI, "SrvTime %d sec since 1970 (utc: %d) diff: %d\n",
-(int)ts.tv_sec, (int)utc.tv_sec,
-(int)(utc.tv_sec - ts.tv_sec));
-   val = (int)(utc.tv_sec - ts.tv_sec);
+(int)ts.tv_sec, (int)utc,
+(int)(utc - ts.tv_sec));
+   val = (int)(utc - ts.tv_sec);
seconds = abs(val);
result = (seconds / MIN_TZ_ADJ) * MIN_TZ_ADJ;
remain = seconds % MIN_TZ_ADJ;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index b261db3..c3b2fa0 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -322,9 +322,9 @@ cifs_create_dfs_fattr(struct cifs_fattr *fattr, struct 
super_block *sb)
fattr->cf_mode = S_IFDIR | S_IXUGO | S_IRWXU;
fattr->cf_uid = cifs_sb->mnt_uid;
fattr->cf_gid = cifs_sb->mnt_gid;
-   fattr->cf_atime = CURRENT_TIME;
-   fattr->cf_ctime = CURRENT_TIME;
-   fattr->cf_mtime = CURRENT_TIME;
+   ktime_get_real_ts(>cf_mtime);
+   fattr->cf_mtime = timespec_trunc(fattr->cf_mtime, sb->s_time_gran);
+   fattr->cf_atime = fattr->cf_ctime = fattr->cf_mtime;
fattr->cf_nlink = 2;
fattr->cf_flags |= CIFS_FATTR_DFS_REFERRAL;
 }
@@ -586,9 +586,10 @@ static int cifs_sfu_mode(struct cifs_fattr *fattr, const 
unsigned char *path,
 /* Fill a cifs_fattr struct with info from FILE_ALL_INFO */
 static void
 cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,
-  struct cifs_sb_info *cifs_sb, bool adjust_tz,
+  struct super_block *sb, bool adjust_tz,
   bool symlink)
 {
+   struct cifs_sb_info *cifs_sb = CIFS_SB(sb);
struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
 
memset(fattr, 0, sizeof(*fattr));
@@ -598,8 +599,10 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, 
FILE_ALL_INFO *info,
 
if (info->LastAccessTime)
fattr->cf_atime = cifs_NTtimeToUnix(info->LastAccessTime);
-   else
-   fattr->cf_atime = CURRENT_TIME;
+   else {
+   ktime_get_real_ts(>cf_atime);
+   fattr->cf_atime = timespec_trunc(fattr->cf_atime, 
sb->s_time_gran);
+   }
 
fattr->cf_ctime = 

[PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-04-07 Thread Deepa Dinamani
CURRENT_TIME is not y2038 safe.
The macro will be deleted and all the references to it
will be replaced by ktime_get_* apis.

struct timespec is also not y2038 safe.
Retain timespec for timestamp representation here as ceph
uses it internally everywhere.
These references will be changed to use struct timespec64
in a separate patch.

The current_fs_time() api is being changed to use vfs
struct inode* as an argument instead of struct super_block*.

Set the new mds client request r_stamp field using
ktime_get_real_ts() instead of using current_fs_time().

Also, since r_stamp is used as mtime on the server, use
timespec_trunc() to truncate the timestamp, using the right
granularity from the superblock.

This api will be transitioned to be y2038 safe along
with vfs.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
---
 drivers/block/rbd.c   | 2 +-
 fs/ceph/mds_client.c  | 4 +++-
 net/ceph/messenger.c  | 6 --
 net/ceph/osd_client.c | 4 ++--
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 517838b..77204da 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
rbd_obj_request *obj_request)
 {
struct ceph_osd_request *osd_req = obj_request->osd_req;
 
-   osd_req->r_mtime = CURRENT_TIME;
+   ktime_get_real_ts(_req->r_mtime);
osd_req->r_data_offset = obj_request->offset;
 }
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index c681762..1d3fa90 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1666,6 +1666,7 @@ struct ceph_mds_request *
 ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
 {
struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
+   struct timespec ts;
 
if (!req)
return ERR_PTR(-ENOMEM);
@@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, 
int op, int mode)
init_completion(>r_safe_completion);
INIT_LIST_HEAD(>r_unsafe_item);
 
-   req->r_stamp = current_fs_time(mdsc->fsc->sb);
+   ktime_get_real_ts();
+   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
 
req->r_op = op;
req->r_direct_mode = mode;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f76bb33..5766a6c 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1386,8 +1386,9 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)
dout("prepare_write_keepalive %p\n", con);
con_out_kvec_reset(con);
if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
-   struct timespec now = CURRENT_TIME;
+   struct timespec now;
 
+   ktime_get_real_ts();
con_out_kvec_add(con, sizeof(tag_keepalive2), _keepalive2);
ceph_encode_timespec(>out_temp_keepalive2, );
con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
@@ -3176,8 +3177,9 @@ bool ceph_con_keepalive_expired(struct ceph_connection 
*con,
 {
if (interval > 0 &&
(con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) {
-   struct timespec now = CURRENT_TIME;
+   struct timespec now;
struct timespec ts;
+   ktime_get_real_ts();
jiffies_to_timespec(interval, );
ts = timespec_add(con->last_keepalive_ack, ts);
return timespec_compare(, ) >= 0;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index e15ea9e..242d7c0 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -3574,7 +3574,7 @@ ceph_osdc_watch(struct ceph_osd_client *osdc,
ceph_oid_copy(>t.base_oid, oid);
ceph_oloc_copy(>t.base_oloc, oloc);
lreq->t.flags = CEPH_OSD_FLAG_WRITE;
-   lreq->mtime = CURRENT_TIME;
+   ktime_get_real_ts(>mtime);
 
lreq->reg_req = alloc_linger_request(lreq);
if (!lreq->reg_req) {
@@ -3632,7 +3632,7 @@ int ceph_osdc_unwatch(struct ceph_osd_client *osdc,
ceph_oid_copy(>r_base_oid, >t.base_oid);
ceph_oloc_copy(>r_base_oloc, >t.base_oloc);
req->r_flags = CEPH_OSD_FLAG_WRITE;
-   req->r_mtime = CURRENT_TIME;
+   ktime_get_real_ts(>r_mtime);
osd_req_op_watch_init(req, 0, lreq->linger_id,
  CEPH_OSD_WATCH_OP_UNWATCH);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/12] audit: Use timespec64 to represent audit timestamps

2017-04-07 Thread Deepa Dinamani
struct timespec is not y2038 safe.
Audit timestamps are recorded in string format into
an audit buffer for a given context.
These mark the entry timestamps for the syscalls.
Use y2038 safe struct timespec64 to represent the times.
The log strings can handle this transition as strings can
hold upto 1024 characters.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
Acked-by: Paul Moore 
Acked-by: Richard Guy Briggs 
---
 include/linux/audit.h |  4 ++--
 kernel/audit.c| 10 +-
 kernel/audit.h|  2 +-
 kernel/auditsc.c  |  6 +++---
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 6fdfefc..f830508 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -332,7 +332,7 @@ static inline void audit_ptrace(struct task_struct *t)
/* Private API (for audit.c only) */
 extern unsigned int audit_serial(void);
 extern int auditsc_get_stamp(struct audit_context *ctx,
- struct timespec *t, unsigned int *serial);
+ struct timespec64 *t, unsigned int *serial);
 extern int audit_set_loginuid(kuid_t loginuid);
 
 static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
@@ -511,7 +511,7 @@ static inline void __audit_seccomp(unsigned long syscall, 
long signr, int code)
 static inline void audit_seccomp(unsigned long syscall, long signr, int code)
 { }
 static inline int auditsc_get_stamp(struct audit_context *ctx,
- struct timespec *t, unsigned int *serial)
+ struct timespec64 *t, unsigned int *serial)
 {
return 0;
 }
diff --git a/kernel/audit.c b/kernel/audit.c
index 2f4964c..fcbf377 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1625,10 +1625,10 @@ unsigned int audit_serial(void)
 }
 
 static inline void audit_get_stamp(struct audit_context *ctx,
-  struct timespec *t, unsigned int *serial)
+  struct timespec64 *t, unsigned int *serial)
 {
if (!ctx || !auditsc_get_stamp(ctx, t, serial)) {
-   *t = CURRENT_TIME;
+   ktime_get_real_ts64(t);
*serial = audit_serial();
}
 }
@@ -1652,7 +1652,7 @@ struct audit_buffer *audit_log_start(struct audit_context 
*ctx, gfp_t gfp_mask,
 int type)
 {
struct audit_buffer *ab;
-   struct timespec t;
+   struct timespec64 t;
unsigned int uninitialized_var(serial);
 
if (audit_initialized != AUDIT_INITIALIZED)
@@ -1705,8 +1705,8 @@ struct audit_buffer *audit_log_start(struct audit_context 
*ctx, gfp_t gfp_mask,
}
 
audit_get_stamp(ab->ctx, , );
-   audit_log_format(ab, "audit(%lu.%03lu:%u): ",
-t.tv_sec, t.tv_nsec/100, serial);
+   audit_log_format(ab, "audit(%llu.%03lu:%u): ",
+(unsigned long long)t.tv_sec, t.tv_nsec/100, 
serial);
 
return ab;
 }
diff --git a/kernel/audit.h b/kernel/audit.h
index 0f1cf6d..cdf96f4 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -112,7 +112,7 @@ struct audit_context {
enum audit_statestate, current_state;
unsigned intserial; /* serial number for record */
int major;  /* syscall number */
-   struct timespec ctime;  /* time of syscall entry */
+   struct timespec64   ctime;  /* time of syscall entry */
unsigned long   argv[4];/* syscall arguments */
longreturn_code;/* syscall return code */
u64 prio;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index e59ffc7..a2d9217 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1532,7 +1532,7 @@ void __audit_syscall_entry(int major, unsigned long a1, 
unsigned long a2,
return;
 
context->serial = 0;
-   context->ctime  = CURRENT_TIME;
+   ktime_get_real_ts64(>ctime);
context->in_syscall = 1;
context->current_state  = state;
context->ppid   = 0;
@@ -1941,13 +1941,13 @@ EXPORT_SYMBOL_GPL(__audit_inode_child);
 /**
  * auditsc_get_stamp - get local copies of audit_context values
  * @ctx: audit_context for the task
- * @t: timespec to store time recorded in the audit_context
+ * @t: timespec64 to store time recorded in the audit_context
  * @serial: serial value that is recorded in the audit_context
  *
  * Also sets the context as auditable.
  */
 int auditsc_get_stamp(struct audit_context *ctx,
-  struct timespec *t, unsigned int *serial)
+  struct timespec64 *t, unsigned int *serial)
 {
if (!ctx->in_syscall)
return 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a 

[PATCH 00/12] Delete CURRENT_TIME, CURRENT_TIME_SEC and current_fs_time

2017-04-07 Thread Deepa Dinamani
The series contains the last unmerged uses of CURRENT_TIME,
CURRENT_TIME_SEC, and current_fs_time().
The series also deletes these apis.

All the patches except [PATCH 9/12] and [PATCH 10/12] are resend patches.
These patches fix new instances of CURRENT_TIME.
cifs and ceph patches have been squashed so that we have one patch per
filesystem.

We want to get these merged onto 4.12 release so that I can post the series
that changes vfs timestamps to use 64 bits for 4.13 release.

I'm proposing these to be merged through Andrew's tree.

Filesystem maintainers, please let Andrew know if you will be picking up
the patch in your trees.

Let me know if anybody has other preferences for merging.

Deepa Dinamani (12):
  fs: f2fs: Use ktime_get_real_seconds for sit_info times
  trace: Make trace_hwlat timestamp y2038 safe
  fs: cifs: Replace CURRENT_TIME by other appropriate apis
  fs: ceph: CURRENT_TIME with ktime_get_real_ts()
  fs: ufs: Use ktime_get_real_ts64() for birthtime
  audit: Use timespec64 to represent audit timestamps
  fs: btrfs: Use ktime_get_real_ts for root ctime
  fs: ubifs: Replace CURRENT_TIME_SEC with current_time
  lustre: Replace CURRENT_TIME macro
  apparmorfs: Replace CURRENT_TIME with current_time()
  time: Delete CURRENT_TIME_SEC and CURRENT_TIME
  time: Delete current_fs_time() function

 drivers/block/rbd.c |  2 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c |  6 +++---
 drivers/staging/lustre/lustre/osc/osc_io.c  |  4 ++--
 fs/btrfs/root-tree.c|  3 ++-
 fs/ceph/mds_client.c|  4 +++-
 fs/cifs/cifsencrypt.c   |  4 +++-
 fs/cifs/cifssmb.c   | 10 -
 fs/cifs/inode.c | 28 +
 fs/f2fs/segment.c   |  2 +-
 fs/f2fs/segment.h   |  5 +++--
 fs/ubifs/dir.c  | 12 +--
 fs/ubifs/file.c | 12 +--
 fs/ubifs/ioctl.c|  2 +-
 fs/ubifs/misc.h | 10 -
 fs/ubifs/sb.c   | 14 +
 fs/ubifs/xattr.c|  6 +++---
 fs/ufs/ialloc.c |  6 --
 include/linux/audit.h   |  4 ++--
 include/linux/fs.h  |  1 -
 include/linux/time.h|  3 ---
 kernel/audit.c  | 10 -
 kernel/audit.h  |  2 +-
 kernel/auditsc.c|  6 +++---
 kernel/time/time.c  | 14 -
 kernel/trace/trace_entries.h|  6 +++---
 kernel/trace/trace_hwlat.c  | 14 ++---
 kernel/trace/trace_output.c |  9 
 net/ceph/messenger.c|  6 --
 net/ceph/osd_client.c   |  4 ++--
 security/apparmor/apparmorfs.c  |  2 +-
 30 files changed, 100 insertions(+), 111 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-07 Thread Peter Grandi
[ ... ]
>>> I've got a mostly inactive btrfs filesystem inside a virtual
>>> machine somewhere that shows interesting behaviour: while no
>>> interesting disk activity is going on, btrfs keeps
>>> allocating new chunks, a GiB at a time.
[ ... ]
> Because the allocator keeps walking forward every file that is
> created and then removed leaves a blank spot behind.

That is a typical "log-structured" filesystem behaviour, not
really surprised that Btrfs is doing something like that being
COW. NILFS2 works like that and it requires a compactor (which
does the requivalent of 'balance' and 'defrag'). It is all about
tradeoffs.

With Btrfs I figured out that fairly frequent 'balance' is
really quite important, even with low percent values like
"usage=50", and usually even 'usage=90' does not take a long
time (while the default takes often a long time, I suspect
needlessly).

>> From the exact moment I did mount -o remount,nossd on this
>> filesystem, the problem vanished.

Haha. Indeed. So it switches from "COW" to more like "log
structured" with the 'ssd' option. F2FS can switch like that
too, with some tunables IIIRC. Except that modern flash SSDs
already do the "log structured" bit internally, so doing in in
Btrfs does not really help that much.

>> And even I saw some early prototypes inside the codes to
>> allow btrfs do allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

I am surprised that this is not already there, but it is a
terrible fix to a big mistake. The big mistake, that nearly all
filesystem designers do, is to assume that contiguous allocation
must bew done by writing contiguous large blocks or extents.

This big mistake was behind the stupid idea of the BSD FFS to
raise the block size from 512B to 4096B plus 512B "tails", and
endless stupid proposals to raise page and block sizes that get
done all the time, and is behind the stupid idea of doing
"delayed allocation", so large extents can be written in one go.

The ancient and tried and obvious idea is to preallocate space
ahead of it being written, so that a file physical size may be
larger than its logical length, and by how much it depends on
some adaptive logic, or hinting from the application (if the
file size if known in advance it can be to preallocate the whole
file).

> [ ... ] So, this is why putting your /var/log, /var/lib/mailman and
> /var/spool on btrfs is a terrible idea. [ ... ]

That is just the old "writing a file slowly" issue, and many if
not most filesystems have this issue:

  http://www.sabi.co.uk/blog/15-one.html?150203#150203

and as that post shows it was already reported for Btrfs here:

  http://kreijack.blogspot.co.uk/2014/06/btrfs-and-systemd-journal.html

> [ ... ] The fun thing is that this might work, but because of
> the pattern we end up with, a large write apparently fails
> (the files downloaded when doing apt-get update by daily cron)
> which causes a new chunk allocation. This is clearly visible
> in the videos. Directly after that, the new chunk gets filled
> with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc... [
> ... ]

The general problem is that filesystems have a very difficult
job especially on rotating media and cannot avoid large
important degenerate corner case by using any adaptive logic.

Only predictive logic can avoid them, and since psychic code is
not possible yet, "predictive" means hints from applications and
users, and application developers and users are usually not
going to give them, or give them wrong.

Consider the "slow writing" corner case, common to logging or
downloads, that you mention: the filesystem logic cannot do well
in the general case because it cannot predict how large the
final file will be, or what the rate of writing will be.

However if the applications or users hint the total final size
or at least a suitable allocation size things are going to be
good. But it is already difficult to expect applications to give
absolutely necessary 'fsync's, so explicit file size or access
pattern hints are a bit of an illusion. It is the ancient
'O_PONIES' issue in one of its many forms.

Fortunately it possible and even easy to do much better
*synthetic* hinting than most library and kernels do today:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d
  http://www.sabi.co.uk/blog/anno05-4th.html?051011b#051011b
  http://www.sabi.co.uk/blog/anno05-4th.html?051011#051011
  http://www.sabi.co.uk/blog/anno05-4th.html?051010#051010

But that has not happened because it is no developer's itch to
fix. I was instead partially impressed that recently the
'vm_cluster' implementation was "fixed", after only one or two
decades from being first reported:

  http://sabi.co.uk/blog/anno05-3rd.html?050923#050923
  https://lwn.net/Articles/716296/
  https://lkml.org/lkml/2001/1/30/160

And still the author(s) of the fix don't see to be persuaded by
many decades of 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-07 Thread Hans van Kranenburg
Ok, I'm going to revive a year old mail thread here with interesting new
info:

On 05/31/2016 03:36 AM, Qu Wenruo wrote:
> 
> 
> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>> Hi,
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)

That picture is still there, for the idea.

> Nice picture.
> Really better than 1000 words.
> 
> AFAIK, the problem may be caused by fragments.

Free space fragmentation is a key thing here indeed.

The major two things involved here are 1) the extent allocator, which
causes the free space fragmentation 2) the extent allocator, which
doesn't handle the fragmentation it just caused really well.

Let's start with the pictures, instead of too many words. The following
two videos are png images of the 4 block groups with highest vaddr.
Every 15 minutes a picture is created, and then they're added together:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

And, with autodefrag enabled, which was the first thing I tried as a change:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-13-autodefrag-ichiban.mp4

So, this is why putting your /var/log, /var/lib/mailman and /var/spool
on btrfs is a terrible idea.

Because the allocator keeps walking forward every file that is created
and then removed leaves a blank spot behind.

Autodefrag makes the situation only a little bit better, changing the
resulting pattern from a sky full of stars into a snowstorm. The result
of taking a few small writes and rewriting them again is that again the
small parts of free space are left behind.

Just a random idea.. for this write pattern, always putting new writes
in the first free available spot at the beginning of the block group
would make a total difference, since the little 4/8KiB parts would be
filled up again all the time, preventing the shotgun blast to spread all
over.

> And even I saw some early prototypes inside the codes to allow btrfs do
> allocation smaller extent than required.
> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
> 
> But it's still prototype and seems no one is really working on it now.
> 
> So when btrfs is writing new data, for example, to write about 16M data,
> it will need to allocate a 16M continuous extent, and if it can't find
> large enough space to allocate, then create a new data chunk.
> 
> [...]

That's the cluster idea right? Combining free space fragments into a
bigger piece of space to fill with writes?

The fun thing is that this might work, but because of the pattern we end
up with, a large write apparently fails (the files downloaded when doing
apt-get update by daily cron) which causes a new chunk allocation. This
is clearly visible in the videos. Directly after that, the new chunk
gets filled with the same pattern, because the extent allocator now
continues there and next day same thing happens again etc...

And voila, there's the answer to my original question.

Now, another surprise:

>From the exact moment I did mount -o remount,nossd on this filesystem,
the problem vanished.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png

I don't have a new video yet, but I'll set up a cron tonight and post it
later.

I'm going to send another mail specifically about the nossd/ssd
behaviour and other things I found out last week, but that'll probably
be tomorrow.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix segment fault when doing dio read

2017-04-07 Thread Liu Bo
Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.

So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segment fault.

This also adds ASSERT to check page offset against page size.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c875e68..5e71f1e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7972,8 +7972,10 @@ static int __btrfs_correct_data_nocsum(struct inode 
*inode,
 
start += sectorsize;
 
-   if (nr_sectors--) {
+   nr_sectors--;
+   if (nr_sectors) {
pgoff += sectorsize;
+   ASSERT(pgoff < PAGE_SIZE);
goto next_block_or_try_again;
}
}
@@ -8074,8 +8076,10 @@ static int __btrfs_subio_endio_read(struct inode *inode,
 
ASSERT(nr_sectors);
 
-   if (--nr_sectors) {
+   nr_sectors--;
+   if (nr_sectors) {
pgoff += sectorsize;
+   ASSERT(pgoff < PAGE_SIZE);
goto next_block;
}
}
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?

2017-04-07 Thread Kai Herlemann
Hi @all who answered,
thank for your help and please excuse my late answer. I didn't see
your answers because of misconfiguration of my GMail filter for that
list.
The filesystem contains backups of some other filesystems (it's on a
external storage which is mirrored by RAID 1). So, if the filesystem
get lost, there are still the source partitions of the backups. But
you're right, I have to have actually more space purposes like using
btrfs-convert, which needs a backup, but can't/want afford that at the
moment because I'm student.

As far as I remember, I used in the past btrfs-convert to convert the
ext4 root partition of my laptop. It didn't really work with old
versions of btrfs-progs (I rollbacked it then or imported a backup),
but it worked with 4.4 or so.
Nevertheless, I won't use btrfs-convert after your warnings, and will
create a new filesystem and copy the data from ext4 to btrfs when I
have enough space.

Thank you all,
Kai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] tests: use receive -e to terminate on end marker

2017-04-07 Thread David Sterba
On Mon, Apr 03, 2017 at 10:21:08PM +0200, Christian Brauner wrote:
> Signed-off-by: Christian Brauner 
> ---
>  tests/misc-tests/018-recv-end-of-stream/test.sh | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/tests/misc-tests/018-recv-end-of-stream/test.sh 
> b/tests/misc-tests/018-recv-end-of-stream/test.sh
> index d39683e9..90655929 100755
> --- a/tests/misc-tests/018-recv-end-of-stream/test.sh
> +++ b/tests/misc-tests/018-recv-end-of-stream/test.sh
> @@ -34,7 +34,7 @@ test_full_empty_stream() {
>  
>   run_check $TOP/mkfs.btrfs -f $TEST_DEV
>   run_check_mount_test_dev
> - run_check $SUDO_HELPER $TOP/btrfs receive -v -f "$str" "$TEST_MNT"
> + run_check $SUDO_HELPER $TOP/btrfs receive -e -v -f "$str" "$TEST_MNT"

You're changing an existing test, what I expect is a new testcase as the
existing tests cover existing usecases and must not break. However I
don't see now how to fix it, will look into it again next week.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-07 13:05, John Petrini wrote:

The use case actually is not Ceph, I was just drawing a comparison
between Ceph's object replication strategy vs BTRF's chunk mirroring.
That's actually a really good comparison that I hadn't thought of 
before.  From what I can tell from my limited understanding of how Ceph 
works, the general principals are pretty similar, except that BTRFS 
doesn't understand or implement failure domains (although having CRUSH 
implemented in BTRFS for chunk placement would be a killer feature IMO).


I do find the conversation interesting however as I work with Ceph
quite a lot but have always gone with the default XFS filesystem for
on OSD's.

From a stability perspective, I would normally go with XFS still for 
the OSD's.  Most of the data integrity features provided by BTRFS are 
also implemented in Ceph, so you don't gain much other than flexibility 
currently by using BTRFS instead of XFS.  The one advantage BTRFS has in 
my experience over XFS for something like this is that it seems (with 
recent versions at least) to be more likely to survive a power-failure 
without any serious data loss than XFS is, but that's not really a 
common concern in Ceph's primary use case.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread John Petrini
The use case actually is not Ceph, I was just drawing a comparison
between Ceph's object replication strategy vs BTRF's chunk mirroring.

I do find the conversation interesting however as I work with Ceph
quite a lot but have always gone with the default XFS filesystem for
on OSD's.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-07 12:58, John Petrini wrote:

When you say "running BTRFS raid1 on top of LVM RAID0 volumes" do you
mean creating two LVM RAID-0 volumes and then putting BTRFS RAID1 on
the two resulting logical volumes?
Yes, although it doesn't have to be LVM, it could just as easily be MD 
or even hardware RAID (I just prefer LVM for the flexibility it offers).


A quick tip regarding this, it seems to get the best performance if the 
stripe size (the -I option for lvcreate) is chosen so that it either 
matches the BTRFS block size, or such that each block in BTRFS gets 
striped across all the disks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread John Petrini
When you say "running BTRFS raid1 on top of LVM RAID0 volumes" do you
mean creating two LVM RAID-0 volumes and then putting BTRFS RAID1 on
the two resulting logical volumes?
___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com
//
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetr...@coredial.com


Interested in sponsoring PartnerConnex 2017? Learn more.

The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission,  dissemination or
other use of, or taking of any action in reliance upon, this
information by persons or entities other than the intended recipient
is prohibited. If you received this in error, please contact the
sender and delete the material from any computer.


On Fri, Apr 7, 2017 at 12:51 PM, Austin S. Hemmelgarn
 wrote:
> On 2017-04-07 12:04, Chris Murphy wrote:
>>
>> On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
>>> which while it provides no better data safety than BTRFS raid10 mode,
>>> gets
>>> noticeably better performance.
>>
>>
>> This does in fact have better data safety than Btrfs raid10 because it
>> is possible to lose more than one drive without data loss. You can
>> only lose drives on one side of the mirroring, however. This is a
>> conventional raid0+1, so it's not as scalable as raid10 when it comes
>> to rebuild time.
>>
> That's a good point that I don't often remember, and I'm pretty sure that
> such an array will rebuild slower from a single device loss than BTRFS
> raid10 would, but most of that should be that BTRFS is smart enough to only
> rewrite what it has to.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-07 12:28, Chris Murphy wrote:

On Fri, Apr 7, 2017 at 7:50 AM, Austin S. Hemmelgarn
 wrote:


If you care about both performance and data safety, I would suggest using
BTRFS raid1 mode on top of LVM or MD RAID0 together with having good backups
and good monitoring.  Statistically speaking, catastrophic hardware failures
are rare, and you'll usually have more than enough warning that a device is
failing before it actually does, so provided you keep on top of monitoring
and replace disks that are showing signs of impending failure as soon as
possible, you will be no worse off in terms of data integrity than running
ext4 or XFS on top of a LVM or MD RAID10 volume.



Depending on the workload, and what replication is being used by Ceph
above this storage stack, it might make make more sense to do
something like three lvm/md raid5 arrays, and then Btrfs single data,
raid1 metadata, across those three raid5s. That's giving up only three
drives to parity rather than 1/2 the drives, and rebuild time is
shorter than losing one drive in a raid0 array.
Ah, I had forgotten it was a Ceph back-end system.  In that case, I 
would actually suggest essentially the same setup that Chris did, 
although I would personally be a bit more conservative and use RAID6 
instead of RAID5 for the LVM/MD arrays.  As he said though, it really 
depends on what higher-level replication you're doing.  In particular, 
if you're running erasure coding instead of replication at the Ceph 
level, I would probably still go with BTRFS raid1 on top of LVM/MD RAID0 
just to balance out the performance hit from the erasure coding.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-07 12:04, Chris Murphy wrote:

On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
 wrote:


I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
which while it provides no better data safety than BTRFS raid10 mode, gets
noticeably better performance.


This does in fact have better data safety than Btrfs raid10 because it
is possible to lose more than one drive without data loss. You can
only lose drives on one side of the mirroring, however. This is a
conventional raid0+1, so it's not as scalable as raid10 when it comes
to rebuild time.

That's a good point that I don't often remember, and I'm pretty sure 
that such an array will rebuild slower from a single device loss than 
BTRFS raid10 would, but most of that should be that BTRFS is smart 
enough to only rewrite what it has to.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Chris Murphy
On Fri, Apr 7, 2017 at 7:50 AM, Austin S. Hemmelgarn
 wrote:

> If you care about both performance and data safety, I would suggest using
> BTRFS raid1 mode on top of LVM or MD RAID0 together with having good backups
> and good monitoring.  Statistically speaking, catastrophic hardware failures
> are rare, and you'll usually have more than enough warning that a device is
> failing before it actually does, so provided you keep on top of monitoring
> and replace disks that are showing signs of impending failure as soon as
> possible, you will be no worse off in terms of data integrity than running
> ext4 or XFS on top of a LVM or MD RAID10 volume.


Depending on the workload, and what replication is being used by Ceph
above this storage stack, it might make make more sense to do
something like three lvm/md raid5 arrays, and then Btrfs single data,
raid1 metadata, across those three raid5s. That's giving up only three
drives to parity rather than 1/2 the drives, and rebuild time is
shorter than losing one drive in a raid0 array.

If this is one ceph host, then it might make sense to split the drives
up so there are two storage bricks using ceph replication between them
for the equivalent of raid1. One brick can do Btrfs on LVM/md raid5,
call it brick A. The other brick can do XFS on LVM/md linear, call it
brick B. The advantage there is the different bricks are going to have
faster commit to stable media times with a mixed workload. The Btrfs
on raid5 brick will do better with sequential reads and writes. The
XFS on linear will do better with metadata heavy reads and writes.
There's probably some Ceph tuning where you can point certain
workloads to particular volumes, where those volumes are backed by
different priorities to the underlying storage. So you'd setup ceph
volume "mail" to be backed in order by brick B then A.

Not very well known but XFS will parallelize across drives in a
linear/concat arrangement, it's quite useful for e.g. busy mail
servers.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for Apr 7 (btrfs)

2017-04-07 Thread David Sterba
On Fri, Apr 07, 2017 at 08:10:48AM -0700, Randy Dunlap wrote:
> On 04/07/17 08:08, Randy Dunlap wrote:
> > On 04/07/17 01:27, Stephen Rothwell wrote:
> >> Hi all,
> >>
> >> Changes since 20170406:
> >>
> > 
> > on i386:
> > 
> > ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> > 
> > Reported-by: Randy Dunlap 
> > 
> 
> or when built-in:
> 
> fs/built-in.o: In function `scrub_bio_end_io_worker':
> scrub.c:(.text+0x3d1908): undefined reference to `__udivdi3'
> fs/built-in.o: In function `scrub_raid56_parity':
> scrub.c:(.text+0x3d23cc): undefined reference to `__udivdi3'
> scrub.c:(.text+0x3d3342): undefined reference to `__udivdi3'
> scrub.c:(.text+0x3d3755): undefined reference to `__udivdi3'

Sorry, I can't reproduce it here, I've also tried my other for-next
snapshot branches, same. We have some recent patches that could trigger
the __udivdi3 build check,

"Btrfs: update scrub_parity to use u64 stripe_len" (7d0ef8b4dbbd22)

manual check or with help of coccinell didn't show me any instances of
64bit division with / .
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Chris Murphy
On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
 wrote:

> I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
> which while it provides no better data safety than BTRFS raid10 mode, gets
> noticeably better performance.

This does in fact have better data safety than Btrfs raid10 because it
is possible to lose more than one drive without data loss. You can
only lose drives on one side of the mirroring, however. This is a
conventional raid0+1, so it's not as scalable as raid10 when it comes
to rebuild time.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-07 Thread Eric Sandeen
On 4/7/17 10:42 AM, Darrick J. Wong wrote:
> On Fri, Apr 07, 2017 at 01:02:58PM +0800, Eryu Guan wrote:
>> On Thu, Apr 06, 2017 at 11:28:01AM -0500, Eric Sandeen wrote:
>>> On 4/6/17 11:26 AM, Theodore Ts'o wrote:
 On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
>
> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
> after umount/mount cycle, then changed back to original result after
> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)

 I haven't had time to look at this, but I'm not sure this test is a
 reasonable one on the face of it.

 A file system may choose to optimize a file's extent tree for whatever
 reason it wants, whenever it wants, including on an unmount --- and
 that would not be an invalid thing to do.  So to have an xfstests that
 causes a test failure if a file system were to, say, do some cleanup
 at mount or unmount time, or when the file is next opened, to merge
 adjacent extents together (and hence change what is returned by
 FIEMAP) might be strange, or even weird --- but is this any of user
 space's business?  Or anything we want to enforce as wrong wrong wrong
 by xfstests?
>>
>> So I was asking for a review from ext4 side instead of queuing it for
>> next xfstests update :)
> 
> In general FIEMAP can return pretty much whatever it wants, which
> usually means that it won't report extents larger than the underlying
> block mapping extents, though as we've seen it can split a single
> on-disk extent into multiple FIEMAP records for the purpose of reporting
> sharedness.
> 
> For ext3 I'm wondering if it's the case that the first time we FIEMAP an
> indirect map file we see a possibly-merged version of whatever's in the
> particular leaf node we land in; then that information gets cached &
> merged with other records in the extent status tree, such that
> subsequent FIEMAP calls see longer extents than the first time around.
> 
>>> I had the same question.  If the exact behavior isn't defined anywhere,
>>> I don't know what we can be testing, TBH.
>>
>> Agreed, I was about to ask for the expected behavior today if there was
>> no new review comments on this patch.
> 
> I think the expected behavior is that any behavior is expected. :(

I suppose that if any particular filesystem wants to enforce stricter
rules for its own fiemap interface, that could be done in a
filesystem-specific test...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-07 Thread Darrick J. Wong
On Fri, Apr 07, 2017 at 01:02:58PM +0800, Eryu Guan wrote:
> On Thu, Apr 06, 2017 at 11:28:01AM -0500, Eric Sandeen wrote:
> > On 4/6/17 11:26 AM, Theodore Ts'o wrote:
> > > On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
> > >>
> > >> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
> > >> after umount/mount cycle, then changed back to original result after
> > >> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)
> > > 
> > > I haven't had time to look at this, but I'm not sure this test is a
> > > reasonable one on the face of it.
> > > 
> > > A file system may choose to optimize a file's extent tree for whatever
> > > reason it wants, whenever it wants, including on an unmount --- and
> > > that would not be an invalid thing to do.  So to have an xfstests that
> > > causes a test failure if a file system were to, say, do some cleanup
> > > at mount or unmount time, or when the file is next opened, to merge
> > > adjacent extents together (and hence change what is returned by
> > > FIEMAP) might be strange, or even weird --- but is this any of user
> > > space's business?  Or anything we want to enforce as wrong wrong wrong
> > > by xfstests?
> 
> So I was asking for a review from ext4 side instead of queuing it for
> next xfstests update :)

In general FIEMAP can return pretty much whatever it wants, which
usually means that it won't report extents larger than the underlying
block mapping extents, though as we've seen it can split a single
on-disk extent into multiple FIEMAP records for the purpose of reporting
sharedness.

For ext3 I'm wondering if it's the case that the first time we FIEMAP an
indirect map file we see a possibly-merged version of whatever's in the
particular leaf node we land in; then that information gets cached &
merged with other records in the extent status tree, such that
subsequent FIEMAP calls see longer extents than the first time around.

> > I had the same question.  If the exact behavior isn't defined anywhere,
> > I don't know what we can be testing, TBH.
> 
> Agreed, I was about to ask for the expected behavior today if there was
> no new review comments on this patch.

I think the expected behavior is that any behavior is expected. :(

--D

> 
> Thanks for the comments and review!
> 
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for Apr 7 (btrfs)

2017-04-07 Thread Randy Dunlap
On 04/07/17 08:08, Randy Dunlap wrote:
> On 04/07/17 01:27, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20170406:
>>
> 
> on i386:
> 
> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> 
> Reported-by: Randy Dunlap 
> 

or when built-in:

fs/built-in.o: In function `scrub_bio_end_io_worker':
scrub.c:(.text+0x3d1908): undefined reference to `__udivdi3'
fs/built-in.o: In function `scrub_raid56_parity':
scrub.c:(.text+0x3d23cc): undefined reference to `__udivdi3'
scrub.c:(.text+0x3d3342): undefined reference to `__udivdi3'
scrub.c:(.text+0x3d3755): undefined reference to `__udivdi3'


-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] nowait aio: xfs

2017-04-07 Thread Darrick J. Wong
On Fri, Apr 07, 2017 at 06:34:28AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> > On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> >>> + if (unaligned_io) {
> >>> + /* If we are going to wait for other DIO to finish, bail */
> >>> + if ((iocb->ki_flags & IOCB_NOWAIT) &&
> >>> +  atomic_read(>i_dio_count))
> >>> + return -EAGAIN;
> >>>   inode_dio_wait(inode);
> >>
> >> This checks i_dio_count twice in the nowait case, I think it should be:
> >>
> >>if (iocb->ki_flags & IOCB_NOWAIT) {
> >>if (atomic_read(>i_dio_count))
> >>return -EAGAIN;
> >>} else {
> >>inode_dio_wait(inode);
> >>}
> >>
> >>>   if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> >>>   if (flags & IOMAP_DIRECT) {
> >>> + /* A reflinked inode will result in CoW alloc */
> >>> + if (flags & IOMAP_NOWAIT) {
> >>> + error = -EAGAIN;
> >>> + goto out_unlock;
> >>> + }
> >>
> >> This is a bit pessimistic - just because the inode has any shared
> >> extents we could still write into unshared ones.  For now I think this
> >> pessimistic check is fine, but the comment should be corrected.
> > 
> > Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> > is already an existing reservation in the CoW fork then we'll have to
> > CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> > anything in the CoW fork, then we have to see if there are shared blocks
> > by calling _reflink_trim_around_shared.  That performs a refcountbt
> > lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> > 
> > IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> > cover both write-to-reflinked-file cases.
> > 
> 
> IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
> for direct-IO only. This is checked early on, when we are checking for

Ah, ok.  Disregard what I said about moving it then.

--D

> user-passed flags, and if not, -EINVAL is returned.
> 
> 
> -- 
> Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for Apr 7 (btrfs)

2017-04-07 Thread Randy Dunlap
On 04/07/17 01:27, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20170406:
> 

on i386:

ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

Reported-by: Randy Dunlap 

-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


During a btrfs balance nearly all quotas of the subvolumes became exceeded

2017-04-07 Thread Markus Baier

Hello btrfs-list,

today a strange behaviour appered during the btrfs balance process.

I started a btrfs balance operation on the /home subvolume
that contains, as childs, all the subvolumes for the home directories
of the users, every subvolume with it's own quota.

A short time after the start of the balance process no user
was able to write into his homedirectory anymore.
All users got the "your disc quota exceeded" message.

The I checked the qgroups and got the following result:

btrfs qgroup show -r /home/
qgroupidrferexcl max_rfer
 
0/5 16.00KiB16.00KiBnone
0/257   16.00KiB16.00KiBnone
0/258   16.00EiB16.00EiB200.00GiB
0/259   16.00EiB16.00EiB200.00GiB
0/260   16.00EiB16.00EiB200.00GiB
0/261   16.00EiB16.00EiB200.00GiB
0/267   28.00KiB28.00KiB200.00GiB

1/1 16.00EiB16.00EiB900.00GiB

For most of the subvolumes btrfs calculated 16.00EiB
(I think this is the maximum possible size of the filesystem)
as the amount of used space.
A few subvolumes, all of them are nearly empty like the 0/267,
were not afected and showed the normal size of 28.00KiB

I was able to fix the problem with the:
btrfs quota rescan /home
command.
But my question is, is this a already known bug and what can
I do to prevent this problem during the next balance run?

uname -a
Linux condor-control 4.4.39-gentoo #1 SMP Fri Jan 27 19:16:30 CET 2017
x86_64 Intel Core Processor (Hasswell, no TSX) GenuineIntel GNU/Linux

btrfs --version
btrfs-progs v4.6.1

btrfs fi show
Label: noneuuid: ----
Total devices 1 FS bytes used 3.03GiB
devid1 size 31.98GiB used 6.12GiB path /dev/vda2

Label: noneuuid: ----
Total devices 1 FS bytes used 106.33GiB
devid1 size 1024.00GiB used 108.06GiB path /dev/vda2


Thanks,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-07 09:28, John Petrini wrote:

Hi Austin,

Thanks for taking to time to provide all of this great information!

Glad I could help.


You've got me curious about RAID1. If I were to convert the array to
RAID1 could it then sustain a multi drive failure? Or in other words
do I actually end up with mirrored pairs or can a chunk still be
mirrored to any disk in the array? Are there performance implications
to using RAID1 vs RAID10?

For raid10, your data is stored as 2 replicas striped at or below the 
filesystem-block level across all the disks in the array.  Because of 
how the data striping is done currently, you're functionally guaranteed 
to lose data if you lose more than one disk in raid10 mode.  This 
theoretically could be improved so that partial losses could be 
recovered, but doing so with the current implementation would be 
extremely complicated, and as such is not a high priority (although 
patches would almost certainly be welcome).


For raid1, your data is stored as 2 replicas with each entirely on one 
disk, but individual chunks (the higher level allocation in BTRFS) are 
distributed in a round-robin fashion among the disks, so any given 
filesystem block is on exactly 2 disks.  With the current 
implementation, for any reasonably utilized filesystem, you will lose 
data if you lose 2 or more disks in raid1 mode.  That said, there are 
plans (still currently vaporware in favor of getting raid5/6 working) to 
add arbitrary replication levels to BTRFS, so once that hits, you could 
set things to have as many replicas as you want.


In effect, both can currently only sustain one disk failure, but losing 
2 disks in raid10 will probably corrupt files (currently, it will 
functionally kill the FS, although with a bit of theoretically simple 
work this could be changed), while losing 2 disks in raid1 mode will 
usually just make files disappear unless they are larger than the data 
chunk size (usually between 1-5GB depending on the size of the FS), so 
if you're just storing small files, you'll have an easier time 
quantifying data loss with raid1 than raid10.  Both modes have the 
possibility of completely losing the FS if the lost disks happen to take 
out the System chunk.


As for performance, raid10 mode in BTRFS gets better performance, but 
you can get even better performance than that by running BTRFS in raid1 
mode on top of 2 LVM or MD raid0 volumes.  Such a configuration provides 
the same effective data safety as BTRFS raid10, but can get anywhere 
from 5-30% better performance depending on the workload.


If you care about both performance and data safety, I would suggest 
using BTRFS raid1 mode on top of LVM or MD RAID0 together with having 
good backups and good monitoring.  Statistically speaking, catastrophic 
hardware failures are rare, and you'll usually have more than enough 
warning that a device is failing before it actually does, so provided 
you keep on top of monitoring and replace disks that are showing signs 
of impending failure as soon as possible, you will be no worse off in 
terms of data integrity than running ext4 or XFS on top of a LVM or MD 
RAID10 volume.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread John Petrini
Hi Austin,

Thanks for taking to time to provide all of this great information!

You've got me curious about RAID1. If I were to convert the array to
RAID1 could it then sustain a multi drive failure? Or in other words
do I actually end up with mirrored pairs or can a chunk still be
mirrored to any disk in the array? Are there performance implications
to using RAID1 vs RAID10?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges

2017-04-07 Thread David Sterba
On Mon, Mar 13, 2017 at 03:52:16PM +0800, Qu Wenruo wrote:
> + /*
> +  * TODO: To also modify reserved->ranges_reserved to reflect

No new TODOs in the code please.

> +  * the modification.
> +  *
> +  * However as long as we free qgroup reserved according to
> +  * EXTENT_QGROUP_RESERVED, we won't double free.
> +  * So not need to rush.
> +  */
> + ret = clear_record_extent_bits(_I(inode)->io_failure_tree,
> + free_start, free_start + free_len - 1,
> + EXTENT_QGROUP_RESERVED, );
> + if (ret < 0)
> + goto out;
> + freed += changeset.bytes_changed;
> + }
> + btrfs_qgroup_free_refroot(root->fs_info, root->objectid, freed);
> + ret = freed;
> +out:
> + extent_changeset_release();
> + return ret;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-04-07 Thread David Sterba
On Mon, Mar 13, 2017 at 03:52:15PM +0800, Qu Wenruo wrote:
> @@ -3355,12 +3355,14 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   struct btrfs_fs_info *fs_info = block_group->fs_info;
>   struct btrfs_root *root = fs_info->tree_root;
>   struct inode *inode = NULL;
> + struct extent_changeset data_reserved;

Size of this structure is 40 bytes, and it's being added to many
functions. This will be noticeable on the stack consumption.

-extent-tree.c:btrfs_check_data_free_space  40 static
-extent-tree.c:cache_save_setup 96 static
+extent-tree.c:btrfs_check_data_free_space  48 static
+extent-tree.c:cache_save_setup 136 static
-file.c:__btrfs_buffered_write  192 static
+file.c:__btrfs_buffered_write  232 static
-file.c:btrfs_fallocate 208 static
+file.c:btrfs_fallocate 248 static
-inode-map.c:btrfs_save_ino_cache   112 static
+inode-map.c:btrfs_save_ino_cache   152 static
-inode.c:btrfs_direct_IO128 static
+inode.c:btrfs_direct_IO176 static
-inode.c:btrfs_writepage_fixup_worker   88 static
+inode.c:btrfs_writepage_fixup_worker   128 static
-inode.c:btrfs_truncate_block   136 static
+inode.c:btrfs_truncate_block   176 static
-inode.c:btrfs_page_mkwrite 112 static
+inode.c:btrfs_page_mkwrite 152 static
+ioctl.c:cluster_pages_for_defrag   200 static
-ioctl.c:btrfs_defrag_file  312 static
+ioctl.c:btrfs_defrag_file  232 static
-qgroup.c:btrfs_qgroup_reserve_data 136 static
+qgroup.c:btrfs_qgroup_reserve_data 128 static
-relocation.c:prealloc_file_extent_cluster  152 static
+relocation.c:prealloc_file_extent_cluster  192 static

There are generic functions so this will affect non-qgroup workloads as
well. So there need to be a dynamic allocation (which would add another
point of failure), or reworked in another way.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-06 23:25, John Petrini wrote:

Interesting. That's the first time I'm hearing this. If that's the
case I feel like it's a stretch to call it RAID10 at all. It sounds a
lot more like basic replication similar to Ceph only Ceph understands
failure domains and therefore can be configured to handle device
failure (albeit at a higher level)
Yeah, the stacking is a bit odd, and there are some rather annoying 
caveats that make most of the names other than raid5/raid6 misleading. 
In fact, raid1 mode in BTRFS is more like what most people think of as 
RAID10 when run on more than 2 disks than BTRFS raid10 mode is, although 
it stripes at a much higher level.


I do of course keep backups but I chose RAID10 for the mix of
performance and reliability. It doesn't seems worth it losing 50% of
my usable space for the performance gain alone.

Thank you for letting me know about this. Knowing that I think I may
have to reconsider my choice here. I've really been enjoying the
flexibility of BTRS which is why I switched to it in the first place
but with experimental RAID5/6 and what you've just told me I'm
beginning to doubt that it's the right choice.
There are some other options in how you configure it.  Most of the more 
useful operational modes actually require stacking BTRFS on top of LVM 
or MD.  I'm rather fond of running BTRFS raid1 on top of LVM RAID0 
volumes, which while it provides no better data safety than BTRFS raid10 
mode, gets noticeably better performance.  You can also reverse that to 
get something more like traditional RAID10, but you lose the 
self-correcting aspect of BTRFS.


What's more concerning is that I haven't found a good way to monitor
BTRFS. I might be able to accept that the array can only handle a
single drive failure if I was confident that I could detect it but so
far I haven't found a good solution for this.
This I can actually give some advice on.  There are a couple of options, 
but the easiest is to find a piece of generic monitoring software that 
can check the return code of external programs, and then write some 
simple scripts to perform the checks on BTRFS.  The things you want to 
keep an eye on are:


1. Output of 'btrfs dev stats'.  If you've got a new enough copy of 
btrfs-progs, you can pass '--check' and the return code will be non-zero 
if any of the error counters isn't zero.  If you've got to use an older 
version, you'll instead have to write a script to parse the output (I 
will comment that this is much easier in a language like Perl or Python 
than it is in bash).  You want to watch for steady increases in error 
counts or sudden large jumps.  Single intermittent errors are worth 
tracking, but they tend to happen more frequently the larger the array is.


2. Results from 'btrfs scrub'.  This is somewhat tricky because scrub is 
either asynchronous or blocks for a _long_ time.  The simplest option 
I've found is to fire off an asynchronous scrub to run during down-time, 
and then schedule recurring checks with 'btrfs scrub status'.  On the 
plus side, 'btrfs scrub status' already returns non-zero if the scrub 
found errors.


3. Watch the filesystem flags.  Some monitoring software can easily do 
this for you (Monit for example can watch for changes in the flags). 
The general idea here is that BTRFS will go read-only if it hits certain 
serious errors, so you can watch for that transition and send a 
notification when it happens.  This is also worth watching since the 
filesystem flags should not change during normal operation of any 
filesystem.


4. Watch SMART status on the drives and run regular self-tests.  Most of 
the time, issues will show up here before they show up in the FS, so by 
watching this, you may have an opportunity to replace devices before the 
filesystem ends up completely broken.


5. If you're feeling really ambitious, watch the kernel logs for errors 
from BTRFS and whatever storage drivers you use.  This is the least 
reliable thing out of this list to automate,  so I'd not suggest just 
doing this by itself.


The first two items are BTRFS specific.  The rest however, are standard 
things you should be monitoring regardless of what type of storage stack 
you have.  Of these, item 3 will immediately trigger in the event of a 
catastrophic device failure, while 1, 2, and 5 will provide better 
coverage of slow failures, and 4 will cover both aspects.


As far as what to use to actually track these, that really depends on 
your use case.  For tracking on an individual system basis, I'd suggest 
Monit, it's efficient, easy to configure, provides some degree of error 
resilience, and can actually cover a lot of monitoring tasks beyond 
stuff like this.  If you want some kind of centralized monitoring, I'd 
probably go with Nagios, but that's more because that's the standard for 
that type of thing, not because I've used it myself (I much prefer 
per-system decentralized monitoring, with only the checks that systems 
are online 

Re: [PATCH 7/8] nowait aio: xfs

2017-04-07 Thread Goldwyn Rodrigues


On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
>>> +   if (unaligned_io) {
>>> +   /* If we are going to wait for other DIO to finish, bail */
>>> +   if ((iocb->ki_flags & IOCB_NOWAIT) &&
>>> +atomic_read(>i_dio_count))
>>> +   return -EAGAIN;
>>> inode_dio_wait(inode);
>>
>> This checks i_dio_count twice in the nowait case, I think it should be:
>>
>>  if (iocb->ki_flags & IOCB_NOWAIT) {
>>  if (atomic_read(>i_dio_count))
>>  return -EAGAIN;
>>  } else {
>>  inode_dio_wait(inode);
>>  }
>>
>>> if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
>>> if (flags & IOMAP_DIRECT) {
>>> +   /* A reflinked inode will result in CoW alloc */
>>> +   if (flags & IOMAP_NOWAIT) {
>>> +   error = -EAGAIN;
>>> +   goto out_unlock;
>>> +   }
>>
>> This is a bit pessimistic - just because the inode has any shared
>> extents we could still write into unshared ones.  For now I think this
>> pessimistic check is fine, but the comment should be corrected.
> 
> Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> is already an existing reservation in the CoW fork then we'll have to
> CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> anything in the CoW fork, then we have to see if there are shared blocks
> by calling _reflink_trim_around_shared.  That performs a refcountbt
> lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> 
> IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> cover both write-to-reflinked-file cases.
> 

IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
for direct-IO only. This is checked early on, when we are checking for
user-passed flags, and if not, -EINVAL is returned.


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option

2017-04-07 Thread David Sterba
On Mon, Mar 13, 2017 at 03:52:10PM +0800, Qu Wenruo wrote:
> [BUG]
> The easist way to reproduce the bug is:
> --
>  # mkfs.btrfs -f $dev -n 16K
>  # mount $dev $mnt -o inode_cache
>  # btrfs quota enable $mnt
>  # btrfs quota rescan -w $mnt
>  # btrfs qgroup show $mnt
> qgroupid rfer excl
>   
> 0/5  32.00KiB 32.00KiB
>  ^^ Twice the correct value
> --
> 
> And fstests/btrfs qgroup test group can easily detect them with
> inode_cache mount option.
> Although some of them are false alerts since old test cases are using
> fixed golden output.
> While new test cases will use "btrfs check" to detect qgroup mismatch.
> 
> [CAUSE]
> Inode_cache mount option will make commit_fs_roots() to call
> btrfs_save_ino_cache() to update fs/subvol trees, and generate new
> delayed refs.
> 
> However we call btrfs_qgroup_prepare_account_extents() too early, before
> commit_fs_roots().
> This makes the "old_roots" for newly generated extents are always NULL.
> For freeing extent case, this makes both new_roots and old_roots to be
> empty, while correct old_roots should not be empty.
> This causing qgroup numbers not decreased correctly.
> 
> [FIX]
> Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
> just before btrfs_qgroup_account_extents(), and add needed delayed_refs
> handler.
> So qgroup can handle inode_map mount options correctly.
> 
> Signed-off-by: Qu Wenruo 

Patch added to 4.12 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-07 Thread Filipe Manana
On Fri, Apr 7, 2017 at 1:28 AM, Qu Wenruo  wrote:
>
>
> At 04/07/2017 12:02 AM, Filipe Manana wrote:
>>
>> On Thu, Apr 6, 2017 at 2:28 AM, Qu Wenruo  wrote:
>>>
>>> Btrfs allows inline file extent if and only if
>>> 1) It's at offset 0
>>> 2) It's smaller than min(max_inline, page_size)
>>>Although we don't specify if the size is before compression or after
>>>compression.
>>>At least according to current behavior, we are only limiting the size
>>>after compression.
>>> 3) It's the only file extent
>>>So that if we append existing inline extent, it should be converted
>>>to regular file extents.
>>>
>>> However several users in btrfs mail list have reported invalid inline
>>> file extent, which only meets the first two condition, but with regular
>>> file extents following.
>>>
>>> The bug is here for a long long time, so long that we have modified
>>> kernel and btrfs-progs to accept such case, but it's still not designed
>>> behavior, and must be detected and fixed.
>>
>>
>> So after looking at this better, I'm not convinced it's a problem.
>> Other than making btrfs/137 fail when compression is enabled, what
>> problems have you observed?
>
>
> Inline-then-regular file extent layout itself is a problem.

It is when the inline extent represents less than 4096 bytes of data.
I've fixed such problems in the past with the clone ioctl at least.

>
> Just check the behavior of plain btrfs.
>
> Inline extent must be the *only* extent, hole/regular/prealloc extent
> shouldn't co-exist with inline extent.

Yes, if the inline extent represents less than 4096 bytes of data,
which isn't the case this test is exercising.

>
> Such append should convert the inline extent to regular.
>
> Furthermore, this behavior also exposes the confusion of max_inline.
> If we are limiting inline extent size by its compressed size other than
> plain size, then we're hiding a new limit, page size.

You haven't answered my question. So let me state it again.

Did you find any problem with this scenario, where an inline extent
represents 4096 bytes of data?

To me it seems that you didn't either understood nor did any research
to see if this causes a problem neither to understand the steps that
lead to this scenario:

1) You found btrfs/137 failing because of this, then you just copy
pasted it into a new test case and removed the use of send/receive.
You didn't investigate what leads exactly to the potential problem and
you claimed it was probabilistic when it's really fully deterministic
and there's no need to involve snapshots;

2) You claimed there's a problem but you haven't yet said what is the
problem. Does this scenario causes any kernel crash? Does it result in
some sort of corruption? Does it prevent any file operation from
succeeding? So far it doesn't seem it causes any problem from a user's
perspective.

Unless there's evidence this particular case (inline extent containing
4096 bytes of data) causes at least one problem, I don't agree with
inclusion of this test.

thanks




>
> Thanks,
> Qu
>
>
>>
>> btrfs/137 fails due to the incremental send code not being prepared
>> for this case, which does not seem harmful to me because the inline
>> extent represents 4096 bytes of uncompressed data. It would be a
>> problem only if the uncompressed data was less than 4096 bytes.
>>
>> So unless there's evidence that this particular case causes problems
>> somewhere, I don't think it's useful to have this test.
>>
>> As for btrfs/137, I'm sending a fix for the incremental send code.
>>
>> More comments below anyway.
>>
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> v2:
>>>   All suggested by Eryu Guan
>>>   Use loop_counts instead of runtime to avoid naming confusion.
>>>   Fix whitespace issues
>>>   Use fsync instead of sync
>>>   Use $XFS_IO_PROG instead of calling xfs_io directly
>>>   Always output corruption possibility into seqres.full
>>>
>>> v3:
>>>   All suggested by Filipe Manana
>>>   Fix gramma errors
>>>   Use simpler reproducer
>>>   Add test case to compress group
>>> ---
>>>  tests/btrfs/140 | 111
>>> 
>>>  tests/btrfs/140.out |   2 +
>>>  tests/btrfs/group   |   1 +
>>>  3 files changed, 114 insertions(+)
>>>  create mode 100755 tests/btrfs/140
>>>  create mode 100644 tests/btrfs/140.out
>>>
>>> diff --git a/tests/btrfs/140 b/tests/btrfs/140
>>> new file mode 100755
>>> index 000..183d9cd
>>> --- /dev/null
>>> +++ b/tests/btrfs/140
>>> @@ -0,0 +1,111 @@
>>> +#! /bin/bash
>>> +# FS QA Test 140
>>> +#
>>> +# Check if btrfs will create inline-then-regular file layout.
>>> +#
>>> +# Btrfs only allows inline file extent if file is small enough, and any
>>> +# incoming write enlarges the file beyond max_inline should replace
>>> inline
>>> +# extents with regular extents.
>>> +# This is a long standing bug, so fsck won't detect it and kernel will
>>> allow
>>> +# 

Re: [PATCH] btrfs: add missing memset while reading compressed inline extents

2017-04-07 Thread Filipe Manana
On Fri, Apr 7, 2017 at 2:07 AM, Qu Wenruo  wrote:
>
>
> At 04/07/2017 12:07 AM, Filipe Manana wrote:
>>
>> On Wed, Mar 22, 2017 at 2:37 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> At 03/09/2017 10:05 AM, Zygo Blaxell wrote:


 On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote:
>
>
> On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell
>  wrote:
>>
>>
>> From: Zygo Blaxell 
>>
>> This is a story about 4 distinct (and very old) btrfs bugs.
>>
>> Commit c8b978188c ("Btrfs: Add zlib compression support") added
>> three data corruption bugs for inline extents (bugs #1-3).
>>
>> Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
>> fixed bug #1:  uncompressed inline extents followed by a hole and more
>> extents could get non-zero data in the hole as they were read.  The
>> fix
>> was to add a memset in btrfs_get_extent to zero out the hole.
>>
>> Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
>> fixed bug #2:  compressed inline extents which contained non-zero
>> bytes
>> might be replaced with zero bytes in some cases.  This patch removed
>> an
>> unhelpful memset from uncompress_inline, but the case where memset is
>> required was missed.
>>
>> There is also a memset in the decompression code, but this only covers
>> decompressed data that is shorter than the ram_bytes from the extent
>> ref record.  This memset doesn't cover the region between the end of
>> the
>> decompressed data and the end of the page.  It has also moved around a
>> few times over the years, so there's no single patch to refer to.
>>
>> This patch fixes bug #3:  compressed inline extents followed by a hole
>> and more extents could get non-zero data in the hole as they were read
>> (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
>> The fix is the same:  zero out the hole in the compressed case too,
>> by putting a memset back in uncompress_inline, but this time with
>> correct parameters.
>>
>> The last and oldest bug, bug #0, is the cause of the offending inline
>> extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless
>> quirk
>> of behavior somewhere in the btrfs write code.  In a few special
>> cases,
>> an inline extent and hole are allowed to persist where they normally
>> would be combined with later extents in the file.
>>
>> A fast reproducer for bug #0 is presented below.  A few offending
>> extents
>> are also created in the wild during large rsync transfers with the -S
>> flag.  A Linux kernel build (git checkout; make allyesconfig; make
>> -j8)
>> will produce a handful of offending files as well.  Once an offending
>> file is created, it can present different content to userspace each
>> time it is read.
>>
>> Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
>> kernel back to v3.5 has this behavior.  There are fossil records of
>> this
>> bug's effects in commits all the way back to v2.6.32.  I have no
>> reason
>> to believe bug #0 wasn't present at the beginning of btrfs compression
>> support in v2.6.29, but I can't easily test kernels that old to be
>> sure.
>>
>> It is not clear whether bug #0 is worth fixing.  A fix would likely
>> require injecting extra reads into currently write-only paths, and
>> most
>> of the exceptional cases caused by bug #0 are already handled now.
>>
>> Whether we like them or not, bug #0's inline extents followed by holes
>> are part of the btrfs de-facto disk format now, and we need to be able
>> to read them without data corruption or an infoleak.  So enough about
>> bug #0, let's get back to bug #3 (this patch).
>>
>> An example of on-disk structure leading to data corruption:
>>
>> item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
>> inode generation 50 transid 50 size 47424 nbytes 49141
>> block group 0 mode 100644 links 1 uid 0 gid 0
>> rdev 0 flags 0x0(none)
>> item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
>> inode ref index 3 namelen 10 name: DB_File.so
>> item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
>> inline extent data size 1341 ram 4085 compress(zlib)
>> item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
>> extent data disk byte 5367308288 nr 20480
>> extent data offset 0 nr 45056 ram 45056
>> extent compression(zlib)
>
>
>
> So this case is actually different from the 

Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-07 Thread Filipe Manana
On Fri, Apr 7, 2017 at 9:51 AM, Eryu Guan  wrote:
> On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> For example NFS 4.2 supports fallocate but it does not support its
>> KEEP_SIZE flag, so we want to skip tests that use fallocate with that
>> flag on filesystems that don't support it.
>>
>> Suggested-by: Eryu Guan 
>> Signed-off-by: Filipe Manana 
>> ---
>>  common/rc | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/common/rc b/common/rc
>> index e1ab2c6..3d0f089 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -2021,8 +2021,8 @@ _require_xfs_io_command()
>>   "chproj")
>>   testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
>>   ;;
>> - "falloc" )
>> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
>> + "falloc*" )
>> + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`
>
> Sorry, I was wrong about this. It would break the subsequent
> $XFS_IO_PROG -c "help $command" | grep ... command if another $param is
> specified.

Yeah I had noticed that because the following won't cause the return anymore:

test -z "$param" && return

> Seems adding $param to falloc command is the right way, as
> what Darrick did to fiemap in his new test.
>
> -   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> +   testio=`$XFS_IO_PROG -F -f -c "falloc $param 0 1m" $testfile 
> 2>&1`

But in that case grepping the help output, at the very end of the
function, will fail for falloc since its help output fails to match
the grep pattern (as highlighted in the thread you pointed before).
So that grep pattern would have to change as well.

>
> Do you mind me updating these three patches accordingly? Or can you send
> out new version if you like?

Sure, fell free to update it as you feel it's the best way.

Thanks!

>
> Thanks! And sorry again!
>
> Eryu
>
> P.S. I'm thinking of converting all the case switches (except the
> default one) in _require_xfs_io_command() to actually run the $command
> with $param, and doing other cleanups, but that won't block this patch
> and I can do it in another patch.
>
>>   ;;
>>   "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare")
>>   testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \
>> --
>> 2.7.0.rc3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe fstests" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-07 Thread Eryu Guan
On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> For example NFS 4.2 supports fallocate but it does not support its
> KEEP_SIZE flag, so we want to skip tests that use fallocate with that
> flag on filesystems that don't support it.
> 
> Suggested-by: Eryu Guan 
> Signed-off-by: Filipe Manana 
> ---
>  common/rc | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/common/rc b/common/rc
> index e1ab2c6..3d0f089 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -2021,8 +2021,8 @@ _require_xfs_io_command()
>   "chproj")
>   testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
>   ;;
> - "falloc" )
> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> + "falloc*" )
> + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`

Sorry, I was wrong about this. It would break the subsequent
$XFS_IO_PROG -c "help $command" | grep ... command if another $param is
specified. Seems adding $param to falloc command is the right way, as
what Darrick did to fiemap in his new test.

-   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
+   testio=`$XFS_IO_PROG -F -f -c "falloc $param 0 1m" $testfile 
2>&1`

Do you mind me updating these three patches accordingly? Or can you send
out new version if you like?

Thanks! And sorry again!

Eryu

P.S. I'm thinking of converting all the case switches (except the
default one) in _require_xfs_io_command() to actually run the $command
with $param, and doing other cleanups, but that won't block this patch
and I can do it in another patch.

>   ;;
>   "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare")
>   testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html