Re: [ceph-users] cephfs ceph: fill_inode badness

2015-12-06 Thread Don Waterloo
kernel driver. One node is 4.3 kernel (ubuntu wily mainline) and one is 4.2
kernel (ubuntu wily stock)

I don't believe inline data is enabled (nothiing in ceph.conf, nothing in
fstab).

Its mounted like this:

10.100.10.60,10.100.10.61,10.100.10.62:/ /cephfs ceph
_netdev,noauto,noatime,x-systemd.requires=network-online.target,x-systemd.automount,x-systemd.device-timeout=10,name=admin,secret=XXX
0 2

I'm not sure what multiple data pool would mean? I have one metadata, and
one data pool for the cephfs, and then other ceph pools for openstack
cinder and one i tried w/ docker registry that didn't work and i backed out.

~$ ceph osd lspools
0 rbd,1 mypool,4 cinder-volumes,5 docker,12 cephfs_metadata,13 cephfs_data,

In this last case, one node was unable to read that file (.profile), but
the other node that had it mounted was. A reboot of the affected node
returned access to the file. In my previous case, no node was able to read
the affected file, and stat failed on it (where here stat did not fail but
read did).

ceph is 0.94.5-0ubuntu0.15.10.1

~$ ceph status
cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
 health HEALTH_OK
 monmap e1: 3 mons at {nubo-1=
10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
election epoch 970, quorum 0,1,2 nubo-1,nubo-2,nubo-3
 mdsmap e537: 1/1/1 up {0=nubo-1=up:active}, 2 up:standby
 osdmap e2266: 6 osds: 6 up, 6 in
  pgmap v99487: 840 pgs, 6 pools, 131 GB data, 101916 objects
265 GB used, 5357 GB / 5622 GB avail
 840 active+clean




On 6 December 2015 at 08:18, Yan, Zheng <uker...@gmail.com> wrote:

> On Sun, Dec 6, 2015 at 7:01 AM, Don Waterloo <don.water...@gmail.com>
> wrote:
> > Thanks for the advice.
> >
> > I dumped the filesystem contents, then deleted the cephfs, deleted the
> > pools, and recreated from scratch.
> >
> > I did not track the specific issue in fuse, sorry. It gave an endpoint
> > disconnected message. I will next time for sure.
> >
> > After the dump and recreate, all was good. Until... I now have a file
> with a
> > slightly different symptom. I can stat it, but not read it:
> >
> > don@nubo-2:~$ cat .profile
> > cat: .profile: Input/output error
> > don@nubo-2:~$ stat .profile
> >   File: ‘.profile’
> >   Size: 675 Blocks: 2  IO Block: 4194304 regular file
> > Device: 0h/0d   Inode: 1099511687525  Links: 1
> > Access: (0644/-rw-r--r--)  Uid: ( 1000/ don)   Gid: ( 1000/ don)
> > Access: 2015-12-04 05:08:35.247603061 +
> > Modify: 2015-12-04 05:08:35.247603061 +
> > Change: 2015-12-04 05:13:29.395252968 +
> >  Birth: -
> > don@nubo-2:~$ sum .profile
> > sum: .profile: Input/output error
> > don@nubo-2:~$ ls -il .profile
> > 1099511687525 -rw-r--r-- 1 don don 675 Dec  4 05:08 .profile
> >
> > Would this be a similar problem? Should I give up on cephfs? its been
> > working fine for me for sometime, but now 2 errors in 4 days makes me
> very
> > nervous.
>
> which client are you using(fuse or kernel, and version) ? do you have
> inline data enabled? do you multiple data pool?
>
> Regards
> Yan, Zheng
>
>
> >
> >
> > On 4 December 2015 at 08:16, Yan, Zheng <uker...@gmail.com> wrote:
> >>
> >> On Fri, Dec 4, 2015 at 10:39 AM, Don Waterloo <don.water...@gmail.com>
> >> wrote:
> >> > i have a file which is untouchable: ls -i gives an error, stat gives
> an
> >> > error. it shows ??? for all fields except name.
> >> >
> >> > How do i clean this up?
> >> >
> >>
> >> The safest way to clean this up is create a new directory, move rest
> >> files into the new directory, move the old directory into somewhere
> >> you don't touch, replace the old directory with the new directory.
> >>
> >>
> >> If you still are uncomfortable with it. you can use 'rados -p metadata
> >> rmomapkey ...'  to forcely remove the corrupted file.
> >>
> >> first flush journal
> >> #ceph daemon mds.nubo-2 flush journal
> >>
> >> find inode number of the directory which contains the corrupted file
> >>
> >> #rados -p metadata listomapkeys .
> >>
> >> the output should include the name (with subfix _head) of corrupted file
> >>
> >> #rados -p metadata rmomapkey .
> >> 
> >>
> >> now the file is deleted, but the directory become un-deletable. you
> >> can fix the directory by:
> >>
> >> make sure 'mds verify scatter' config is disable
> >> #ceph daemon mds.nubo-2 config set mds_ve

Re: [ceph-users] cephfs ceph: fill_inode badness

2015-12-05 Thread Don Waterloo
Thanks for the advice.

I dumped the filesystem contents, then deleted the cephfs, deleted the
pools, and recreated from scratch.

I did not track the specific issue in fuse, sorry. It gave an endpoint
disconnected message. I will next time for sure.

After the dump and recreate, all was good. Until... I now have a file with
a slightly different symptom. I can stat it, but not read it:

don@nubo-2:~$ cat .profile
cat: .profile: Input/output error
don@nubo-2:~$ stat .profile
  File: ‘.profile’
  Size: 675 Blocks: 2  IO Block: 4194304 regular file
Device: 0h/0d   Inode: 1099511687525  Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ don)   Gid: ( 1000/ don)
Access: 2015-12-04 05:08:35.247603061 +
Modify: 2015-12-04 05:08:35.247603061 +
Change: 2015-12-04 05:13:29.395252968 +
 Birth: -
don@nubo-2:~$ sum .profile
sum: .profile: Input/output error
don@nubo-2:~$ ls -il .profile
1099511687525 -rw-r--r-- 1 don don 675 Dec  4 05:08 .profile

Would this be a similar problem? Should I give up on cephfs? its been
working fine for me for sometime, but now 2 errors in 4 days makes me very
nervous.


On 4 December 2015 at 08:16, Yan, Zheng <uker...@gmail.com> wrote:

> On Fri, Dec 4, 2015 at 10:39 AM, Don Waterloo <don.water...@gmail.com>
> wrote:
> > i have a file which is untouchable: ls -i gives an error, stat gives an
> > error. it shows ??? for all fields except name.
> >
> > How do i clean this up?
> >
>
> The safest way to clean this up is create a new directory, move rest
> files into the new directory, move the old directory into somewhere
> you don't touch, replace the old directory with the new directory.
>
>
> If you still are uncomfortable with it. you can use 'rados -p metadata
> rmomapkey ...'  to forcely remove the corrupted file.
>
> first flush journal
> #ceph daemon mds.nubo-2 flush journal
>
> find inode number of the directory which contains the corrupted file
>
> #rados -p metadata listomapkeys .
>
> the output should include the name (with subfix _head) of corrupted file
>
> #rados -p metadata rmomapkey .
> 
>
> now the file is deleted, but the directory become un-deletable. you
> can fix the directory by:
>
> make sure 'mds verify scatter' config is disable
> #ceph daemon mds.nubo-2 config set mds_verify_scatter 0
>
> fragment the directory
> #ceph mds tell 0 fragment_dir  the FS>  '0/0' 1
>
> create a file in the directory
> #touch /foo
>
> above two steps will fix directory's stat, now you can delete the directory
> #rm -rf 
>
>
> > I'm on ubuntu 15.10, running 0.94.5
> > # ceph -v
> > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> >
> > the node that accessed the file then caused a problem with mds:
> >
> > root@nubo-1:/home/git/go/src/github.com/gogits/gogs# ceph status
> > cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
> >  health HEALTH_WARN
> > mds0: Client nubo-1 failing to respond to capability release
> >  monmap e1: 3 mons at
> > {nubo-1=
> 10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
> > election epoch 906, quorum 0,1,2 nubo-1,nubo-2,nubo-3
> >  mdsmap e418: 1/1/1 up {0=nubo-2=up:active}, 2 up:standby
> >  osdmap e2081: 6 osds: 6 up, 6 in
> >   pgmap v95696: 560 pgs, 6 pools, 131 GB data, 97784 objects
> > 265 GB used, 5357 GB / 5622 GB avail
> >  560 active+clean
> >
> > Trying a different node, i see the same problem.
> >
> > I'm getting this error dumped to dmesg:
> >
> > [670243.421212] Workqueue: ceph-msgr con_work [libceph]
> > [670243.421213]   e800e516 8810cd68f9d8
> > 817e8c09
> > [670243.421215]    8810cd68fa18
> > 8107b3c6
> > [670243.421217]  8810cd68fa28 ffea 
> > 
> > [670243.421218] Call Trace:
> > [670243.421221]  [] dump_stack+0x45/0x57
> > [670243.421223]  [] warn_slowpath_common+0x86/0xc0
> > [670243.421225]  [] warn_slowpath_null+0x1a/0x20
> > [670243.421229]  [] fill_inode.isra.18+0xc5c/0xc90
> [ceph]
> > [670243.421233]  [] ? inode_init_always+0x107/0x1b0
> > [670243.421236]  [] ? ceph_mount+0x7e0/0x7e0 [ceph]
> > [670243.421241]  [] ceph_fill_trace+0x332/0x910 [ceph]
> > [670243.421248]  [] handle_reply+0x525/0xb70 [ceph]
> > [670243.421255]  [] dispatch+0x3c8/0xbb0 [ceph]
> > [670243.421260]  [] con_work+0x57b/0x1770 [libceph]
> > [670243.421262]  [] ? dequeue_task_fair+0x36b/0x700
> > [670243.421263]  [] ? put_prev_entity+0x31/0x420
> &g

[ceph-users] cephfs ceph: fill_inode badness

2015-12-03 Thread Don Waterloo
i have a file which is untouchable: ls -i gives an error, stat gives an
error. it shows ??? for all fields except name.

How do i clean this up?

I'm on ubuntu 15.10, running 0.94.5
# ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

the node that accessed the file then caused a problem with mds:

root@nubo-1:/home/git/go/src/github.com/gogits/gogs# ceph status
cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
 health HEALTH_WARN
mds0: Client nubo-1 failing to respond to capability release
 monmap e1: 3 mons at {nubo-1=
10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
election epoch 906, quorum 0,1,2 nubo-1,nubo-2,nubo-3
 mdsmap e418: 1/1/1 up {0=nubo-2=up:active}, 2 up:standby
 osdmap e2081: 6 osds: 6 up, 6 in
  pgmap v95696: 560 pgs, 6 pools, 131 GB data, 97784 objects
265 GB used, 5357 GB / 5622 GB avail
 560 active+clean

Trying a different node, i see the same problem.

I'm getting this error dumped to dmesg:

[670243.421212] Workqueue: ceph-msgr con_work [libceph]
[670243.421213]   e800e516 8810cd68f9d8
817e8c09
[670243.421215]    8810cd68fa18
8107b3c6
[670243.421217]  8810cd68fa28 ffea 

[670243.421218] Call Trace:
[670243.421221]  [] dump_stack+0x45/0x57
[670243.421223]  [] warn_slowpath_common+0x86/0xc0
[670243.421225]  [] warn_slowpath_null+0x1a/0x20
[670243.421229]  [] fill_inode.isra.18+0xc5c/0xc90 [ceph]
[670243.421233]  [] ? inode_init_always+0x107/0x1b0
[670243.421236]  [] ? ceph_mount+0x7e0/0x7e0 [ceph]
[670243.421241]  [] ceph_fill_trace+0x332/0x910 [ceph]
[670243.421248]  [] handle_reply+0x525/0xb70 [ceph]
[670243.421255]  [] dispatch+0x3c8/0xbb0 [ceph]
[670243.421260]  [] con_work+0x57b/0x1770 [libceph]
[670243.421262]  [] ? dequeue_task_fair+0x36b/0x700
[670243.421263]  [] ? put_prev_entity+0x31/0x420
[670243.421265]  [] ? __switch_to+0x1f9/0x5c0
[670243.421267]  [] process_one_work+0x1aa/0x440
[670243.421269]  [] worker_thread+0x4b/0x4c0
[670243.421271]  [] ? process_one_work+0x440/0x440
[670243.421273]  [] ? process_one_work+0x440/0x440
[670243.421274]  [] kthread+0xd8/0xf0
[670243.421276]  [] ? kthread_create_on_node+0x1f0/0x1f0
[670243.421277]  [] ret_from_fork+0x3f/0x70
[670243.421279]  [] ? kthread_create_on_node+0x1f0/0x1f0
[670243.421280] ---[ end trace 5cded7a882dfd5d1 ]---
[670243.421282] ceph: fill_inode badness 88179e2d9f28
1004e91.fffe

this problem persisted through a reboot, and there is no fsck to help me.

I also tried with ceph-fuse, but it crashes when I access the file.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Don Waterloo
On 20 December 2015 at 19:23, Francois Lafont <flafdiv...@free.fr> wrote:

> On 20/12/2015 22:51, Don Waterloo wrote:
>
> > All nodes have 10Gbps to each other
>
> Even the link client node <---> cluster nodes?
>
> > OSD:
> > $ ceph osd tree
> > ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 5.48996 root default
> > -2 0.8 host nubo-1
> >  0 0.8 osd.0 up  1.0  1.0
> > -3 0.8 host nubo-2
> >  1 0.8 osd.1 up  1.0  1.0
> > -4 0.8 host nubo-3
> >  2 0.8 osd.2 up  1.0  1.0
> > -5 0.92999 host nubo-19
> >  3 0.92999 osd.3 up  1.0  1.0
> > -6 0.92999 host nubo-20
> >  4 0.92999 osd.4 up  1.0  1.0
> > -7 0.92999 host nubo-21
> >  5 0.92999 osd.5 up  1.0  1.0
> >
> > Each contains 1 x Samsung 850 Pro 1TB SSD (on sata)
> >
> > Each are Ubuntu 15.10 running 4.3.0-040300-generic kernel.
> > Each are running ceph 0.94.5-0ubuntu0.15.10.1
> >
> > nubo-1/nubo-2/nubo-3 are 2x X5650 @ 2.67GHz w/ 96GB ram.
> > nubo-19/nubo-20/nubo-21 are 2x E5-2699 v3 @ 2.30GHz, w/ 576GB ram.
> >
> > the connections are to the chipset sata in each case.
> > The fio test to the underlying xfs disk
> > (e.g. cd /var/lib/ceph/osd/ceph-1; fio --randrepeat=1 --ioengine=libaio
> > --direct=1 --gtod_reduce=1 --name=readwrite --filename=rw.data --bs=4k
> > --iodepth=64 --size=5000MB --readwrite=randrw --rwmixread=50)
> > shows ~22K IOPS on each disk.
> >
> > nubo-1/2/3 are also the mon and the mds:
> > $ ceph status
> > cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
> >  health HEALTH_OK
> >  monmap e1: 3 mons at {nubo-1=
> >
> 10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
> > election epoch 1104, quorum 0,1,2 nubo-1,nubo-2,nubo-3
> >  mdsmap e621: 1/1/1 up {0=nubo-3=up:active}, 2 up:standby
> >  osdmap e2459: 6 osds: 6 up, 6 in
> >   pgmap v127331: 840 pgs, 6 pools, 144 GB data, 107 kobjects
> > 289 GB used, 5332 GB / 5622 GB avail
> >  840 active+clean
> >   client io 0 B/s rd, 183 kB/s wr, 54 op/s
>
> And you have "replica size == 3" in your cluster, correct?
> Do you have specific mount options or specific options in ceph.conf
> concerning ceph-fuse?
>
> So the hardware configuration of your cluster seems to me globally highly
> better than my cluster (config given in my first message) because you have
> 10Gb links (between the client and the cluster I have just 1Gb) and you
> have full SSD OSDs.
>
> I have tried to put _all_ cephfs in my SSD: ie the pools "cephfsdata" _and_
> "cephfsmetadata" are in the SSD. The performances are slightly improved
> because
> I have ~670 iops now (with the fio command of my first message again) but
> it
> still seems to me bad.
>
> In fact, I'm curious to have the opinion of "cephfs" experts to know what
> iops we can expect. If anaything, ~700 iops is a correct iops for our
> hardware
> configuration and maybe we are searching a problem which doesn't exist...


All nodes are interconnected on 10G (actually 8x10G, so 80Gbps, but i have
7 disabled for this test). I have done a 'iperf' w/ TCP and verified I can
achieve ~9Gbps between each pair. I have jumbo frames enabled (so 9000 MTU,
8982 route mtu).

i have replica 2.

My 2 cephfs pools are:

pool 12 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 2239 flags
hashpspool stripe_width 0
pool 13 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 2243 flags
hashpspool crash_replay_interval 45 stripe_width 0

w/ cephfs-fuse, i used default except added noatime.

My ceph.conf is:

[global]
fsid = 
mon_initial_members = nubo-2, nubo-3, nubo-1
mon_host = 10.100.10.61,10.100.10.62,10.100.10.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
public_network = 10.100.10.0/24
osd op threads = 6
osd disk threads = 6

[mon]
mon clock drift allowed = .600
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs 'lag' / hang

2015-12-18 Thread Don Waterloo
I have 3 systems w/ a cephfs mounted on them.
And i am seeing material 'lag'. By 'lag' i mean it hangs for little bits of
time (1s, sometimes 5s).
But very non repeatable.

If i run
time find . -type f -print0 | xargs -0 stat > /dev/null
it might take ~130ms.
But, it might take 10s. Once i've done it, it tends to stay @ the ~130ms,
suggesting whatever data is now in cache. On the cases it hangs, if i
remove the stat, its hanging on the find of one file. It might hiccup 1 or
2 times in the find across 10k files.

This lag might affect e.g. 'cwd', writing a file, basically all operations.

Does anyone have any suggestions? Its very irritating problem. I do no see
errors in dmesg.

The 3 systems w/ the filesystem mounted are running Ubuntu 15.10
w/ 4.3.0-040300-generic kernel. They are running cephfs from the kernel
driver, mounted in /etc/fstab as:

10.100.10.60,10.100.10.61,10.100.10.62:/ /cephfs ceph
_netdev,noauto,noatime,x-systemd.requires=network-online.target,x-systemd.automount,x-systemd.device-timeout=10,name=admin,secret===
0 2

I have 3 mds, 1 active, 2 standby. The 3 machines are also the mons
{nubo-1/-2/-3} are the ones that have the cephfs mounted.

They have a 9K mtu between the systems, and i have checked with ping -s ###
-M do  that there are no blackholes in size... up to 8954 works, and
and 8955 gives 'would fragment'.

All the storage devices are 1TB Samsung SSD, and all are on sata. There is
no material load on the system while this is occurring (a bit of background
fs usage i guess, but its otherwise idle, just me).

$ ceph status
cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
 health HEALTH_OK
 monmap e1: 3 mons at {nubo-1=
10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
election epoch 1070, quorum 0,1,2 nubo-1,nubo-2,nubo-3
 mdsmap e587: 1/1/1 up {0=nubo-2=up:active}, 2 up:standby
 osdmap e2346: 6 osds: 6 up, 6 in
  pgmap v113350: 840 pgs, 6 pools, 143 GB data, 104 kobjects
288 GB used, 5334 GB / 5622 GB avail
 840 active+clean

I've checked and the network between them is perfect: no loss, ~no latency
( << 1ms, they are adjacent on an L2 segment), as are all the osd [there
are 6 osd].

ceph osd tree
ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.48996 root default
-2 0.8 host nubo-1
 0 0.8 osd.0 up  1.0  1.0
-3 0.8 host nubo-2
 1 0.8 osd.1 up  1.0  1.0
-4 0.8 host nubo-3
 2 0.8 osd.2 up  1.0  1.0
-5 0.92999 host nubo-19
 3 0.92999 osd.3 up  1.0  1.0
-6 0.92999 host nubo-20
 4 0.92999 osd.4 up  1.0  1.0
-7 0.92999 host nubo-21
 5 0.92999 osd.5 up  1.0  1.0
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-18 Thread Don Waterloo
On 17 December 2015 at 21:36, Francois Lafont  wrote:

> Hi,
>
> I have ceph cluster currently unused and I have (to my mind) very low
> performances.
> I'm not an expert in benchs, here an example of quick bench:
>
> ---
> # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=readwrite --filename=rw.data --bs=4k --iodepth=64 --size=300MB
> --readwrite=randrw --rwmixread=50
> readwrite: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=64
> fio-2.1.3
>
>  ...

I am seeing the same sort of issue.
If i run your 'fio' command sequence on my cephfs, i see ~120 iops.
If i run it on one of the underlying osd (e.g. in /var... on the mount
point of the xfs), i get ~20k iops.

On the single SSD mount point it completes in ~1s.
On the cephfs, it takes ~17min.

I'm on Ubuntu 15.10 4.3.0-040300-generic kernel.

my 'ceph -w' while this fio is running shows ~550kB/s read/write.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-18 Thread Don Waterloo
On 18 December 2015 at 15:48, Don Waterloo <don.water...@gmail.com> wrote:

>
>
> On 17 December 2015 at 21:36, Francois Lafont <flafdiv...@free.fr> wrote:
>
>> Hi,
>>
>> I have ceph cluster currently unused and I have (to my mind) very low
>> performances.
>> I'm not an expert in benchs, here an example of quick bench:
>>
>> ---
>> # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=readwrite --filename=rw.data --bs=4k --iodepth=64 --size=300MB
>> --readwrite=randrw --rwmixread=50
>> readwrite: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
>> iodepth=64
>> fio-2.1.3
>>
>>  ...
>
> I am seeing the same sort of issue.
> If i run your 'fio' command sequence on my cephfs, i see ~120 iops.
> If i run it on one of the underlying osd (e.g. in /var... on the mount
> point of the xfs), i get ~20k iops.
>
>
> If i run:
rbd -p mypool create speed-test-image --size 1000
rbd -p mypool bench-write speed-test-image

I get

bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1 79053  79070.82  323874082.50
2144340  72178.81  295644410.60
3221975  73997.57  303094057.34
elapsed:10  ops:   262144  ops/sec: 26129.32  bytes/sec: 107025708.32

which is *much* faster than the cephfs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Don Waterloo
On 20 December 2015 at 08:35, Francois Lafont <flafdiv...@free.fr> wrote:

> Hello,
>
> On 18/12/2015 23:26, Don Waterloo wrote:
>
> > rbd -p mypool create speed-test-image --size 1000
> > rbd -p mypool bench-write speed-test-image
> >
> > I get
> >
> > bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
> >   SEC   OPS   OPS/SEC   BYTES/SEC
> > 1 79053  79070.82  323874082.50
> > 2144340  72178.81  295644410.60
> > 3221975  73997.57  303094057.34
> > elapsed:10  ops:   262144  ops/sec: 26129.32  bytes/sec: 107025708.32
> >
> > which is *much* faster than the cephfs.
>
> Me too, I have better performance with rbd (~1400 iops with the fio command
> in my first message instead of ~575 iops with the same fio command and
> cephfs).
>

I did a bit more work on this.

On cephfs-fuse, I get ~700 iops.
On cephfs kernel, I get ~120 iops.
These were both on 4.3 kernel

So i backed up to 3.16 kernel on the client. And observed the same results.

So ~20K iops w/ rbd, ~120iops w/ cephfs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Don Waterloo
On 21 December 2015 at 22:07, Yan, Zheng  wrote:

>
> > OK, so i changed fio engine to 'sync' for the comparison of a single
> > underlying osd vs the cephfs.
> >
> > the cephfs w/ sync is ~ 115iops / ~500KB/s.
>
> This is normal because you were doing single thread sync IO. If
> round-trip time for each OSD request is about 10ms (network latency),
> you can only have about 100 IOPS.
>
>
>
yes... except the RTT is 200us. So that would be 5000 RTT/s.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs 'lag' / hang

2015-12-21 Thread Don Waterloo
On 21 December 2015 at 03:23, Yan, Zheng <uker...@gmail.com> wrote:

> On Sat, Dec 19, 2015 at 4:34 AM, Don Waterloo <don.water...@gmail.com>
> wrote:
> > I have 3 systems w/ a cephfs mounted on them.
> > And i am seeing material 'lag'. By 'lag' i mean it hangs for little bits
> of
> > time (1s, sometimes 5s).
> > But very non repeatable.
> >
> > If i run
> > time find . -type f -print0 | xargs -0 stat > /dev/null
> > it might take ~130ms.
> > But, it might take 10s. Once i've done it, it tends to stay @ the ~130ms,
> > suggesting whatever data is now in cache. On the cases it hangs, if i
> remove
> > the stat, its hanging on the find of one file. It might hiccup 1 or 2
> times
> > in the find across 10k files.
> >
>
>
> When operation hangs, do you see any 'slow request ...' log message in
> the cluster log. Besides, do have have multiple clients accessing the
> filesystem? which version of ceph do you use?
>
> Regards
> Yan, Zheng
>
>
There are some 'slow...' log:

ceph.log.1.gz:2015-12-20 21:48:51.047945 osd.5 10.100.10.124:6801/46249 561
: cluster [WRN] slow request 30.492476 seconds old, received at 2015-12-20
21:48:20.555383: osd_op(client.1294098.1:315704 1056ffe. [write
0~12475] 13.bf7fb0aa snapc 1=[] ondisk+write e2459) currently waiting for
subops from 1

 Its ceph 0.94.5-0ubuntu0.15.10.1 on Ubuntu 15.10 w/
kernel 4.3.0-040300-generic

What does the 'slow request' mean?

The file system is mounted on 3 hosts. The others might be doing some minor
access I suppose, but nothing systemic.

I've had smokeping running between all the osd machines and have 0 loss, ~0
latency at all times. E.g. its 200us average, +- 75us.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-21 Thread Don Waterloo
On 20 December 2015 at 22:47, Yan, Zheng  wrote:

> >> ---
> >>
>
>
> fio tests AIO performance in this case. cephfs does not handle AIO
> properly, AIO is actually SYNC IO. that's why cephfs is so slow in
> this case.
>
> Regards
> Yan, Zheng
>
>
OK, so i changed fio engine to 'sync' for the comparison of a single
underlying osd vs the cephfs.

the cephfs w/ sync is ~ 115iops / ~500KB/s.
the underlying osd storage w/ sync is 6500 iops/270MB/s.

I also don't think this explains why cephfs-fuse faster (~5x faster, but
still ~100x slower than it should be).

If i get rid of fio and use tried-and-true dd:
time dd if=/dev/zero of=rw.data bs=256k count=1
on the underlying osd storage shows 426MB/s.
on the cephfs, it gets 694MB/s.

hmm.

so i guess my 'lag' issue of slow requests is unrelated and is my real
problem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph hang on pg list_unfound

2016-05-18 Thread Don Waterloo
I am running 10.2.0-0ubuntu0.16.04.1.
I've run into a problem w/ cephfs metadata pool. Specifically I have a pg
w/ an 'unfound' object.

But i can't figure out which since when i run:
ceph pg 12.94 list_unfound

it hangs (as does ceph pg 12.94 query). I know its in the cephfs metadata
 pool since I run:
ceph pg ls-by-pool cephfs_metadata |egrep "pg_stat|12\\.94"

and it shows it there:
pg_stat objects mip degrmispunf bytes   log disklog
state   state_stamp v   reportedup  up_primary
 acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub
deep_scrub_stamp
12.94   231 1   1   0   1   90  30923092
 active+recovering+degraded  2016-05-18 23:49:15.718772
 8957'386130 9472:367098 [1,4]   1   [1,4]   1
8935'385144 2016-05-18 10:46:46.123526 8337'379527 2016-05-14
22:37:05.974367

OK, so what is hanging, and how can i get it to unhang so i can run a
'mark_unfound_lost' on it?

pg 12.94 is on osd.0

ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.48996 root default
-2 0.8 host nubo-1
 0 0.8 osd.0 up  1.0  1.0
-3 0.8 host nubo-2
 1 0.8 osd.1 up  1.0  1.0
-4 0.8 host nubo-3
 2 0.8 osd.2 up  1.0  1.0
-5 0.92999 host nubo-19
 3 0.92999 osd.3 up  1.0  1.0
-6 0.92999 host nubo-20
 4 0.92999 osd.4 up  1.0  1.0
-7 0.92999 host nubo-21
 5 0.92999 osd.5 up  1.0  1.0

I cranked the logging on osd.0. I see a lot of messages, but nothing
interesting.

I've double checked all nodes can ping each other. I've run 'xfs_repair' on
the underlying xfs storage to check for issues (there were none).

Can anyone suggest how to uncrack this hang so i can try and repair this
system?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions about cache-tier in 12.1

2017-08-10 Thread Don Waterloo
I have a system w/ 7 hosts.
Each host has 1x1TB NVME, and 2x2TB SATA SSD.

The intent was to use this for openstack, having glance stored on the SSD,
and cinder + nova running cache-tier replication pool on nvme into erasure
coded pool on ssd.

The rationale is that, given the copy-on-write, only the working-set of the
nova images would be dirty, and thus the nvme cache would improve the
latency.

Also, the life span (TBW) of the nvme is much higher, and its rated IOPS is
*much* higher (particularly at low queue depths compared to the SATA). So I
believe this will give the longest-life, highest-perf for me.

I have installed Ceph 12.1.2 on Ubuntu 16.04.

Before I start: does someone have a different config to suggest w/ this
equipment?

OK, so i started to config, but I ran into an (?error? warning?):

$ ceph osd erasure-code-profile set ssd k=2 m=1 plugin=jerasure
technique=reed_sol_van crush-device-class=ssd
$ ceph osd crush rule create-replicated nvme default host nvme
$ ceph osd crush rule create-erasure ssd ssd
$ ceph osd pool create ssd-bulk 1200 erasure ssd
$ ceph osd pool create nvme-cache 1200 nvme

$ ceph osd pool set ssd-bulk allow_ec_overwrites true
$ ceph osd lspools
15 nvme-cache,16 ssd-bulk,

$ ceph osd tier add ssd-bulk nvme-cache
pool 'nvme-cache' is now (or already was) a tier of 'ssd-bulk'

$ ceph osd tier remove ssd-bulk nvme-cache
pool 'nvme-cache' is now (or already was) not a tier of 'ssd-bulk'


So what am I doing wrong? I'm following
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com