[ceph-users] new user error

2017-01-05 Thread Alex Evonosky
Hello group--

I have been running ceph 10.2.3 for awhile now without any issues.  This
evening my admin node (which is also an OSD and Monitor) crashed.  I
checked my other OSD servers and the data seems to still be there.

Is there an easy way to bring the admin node back into the cluster?  I am
trying to bring this admin node/OSD back up without losing any data...

Thank you!

-Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-05 Thread Christian Balzer

Hello,

On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:

> Hello All,
> 
> I have setup a ceph cluster based on 0.94.6 release in  2 servers each with
> 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
> which is connected to a 10G switch with a replica of 2 [ i will add 3 more
> servers to the cluster] and 3 seperate monitor nodes which are vms.
> 
I'd go to the latest hammer, this version has a lethal cache-tier bug if
you should decide to try that.

80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
You're going to wear those out quickly and if not replaced in time loose
data.

2 HDDs give you a theoretical speed of something like 300MB/s sustained,
when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
seeks, journal (file system) and leveldb overheads. 
Which perfectly matches your results.

> rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
> card with 512Mb cache [ssd is in writeback mode wth BBU]
> 
> 
> Before installing ceph, i tried to check max throughpit of intel 3500  80G
> SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and
> it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct}
> 
Irrelevant, sustained sequential writes will be limited by what your OSDs
(HDDs) can sustain.

> *Observation:*
> Now the cluster is up and running and from the vm i am trying to write a 4g
> file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct .It takes aroud 39 seconds to write.
> 
>  during this time ssd journal was showing disk write of 104M on both the
> ceph servers (dstat sdb) and compute node a network transfer rate of ~110M
> on its 10G storage interface(dstat -nN eth2]
> 
As I said, sounds about right.

> 
> my questions are:
> 
> 
>- Is this the best throughput ceph can offer or can anything in my
>environment be optmised to get  more performance? [iperf shows a max
>throughput 9.8Gbits/s]
>
Not your network.

Watch your nodes with atop and you will note that your HDDs are maxed out.
 
> 
> 
>- I guess Network/SSD is under utilized and it can handle more writes
>how can this be improved to send more data over network to ssd?
> 
As jiajia wrote, a cache-tier might give you some speed boosts. 
But with those SSDs I'd advise against it, both too small and too low
endurance.

> 
> 
>- rbd kernel module wasn't loaded on compute node,i loaded it manually
>using "modprobe" and later destroyed/re-created vms,but this doesnot give
>any performance boost. So librbd and RBD are equally fast?
> 
Irrelevant and confusing.
Your VMs will use on or the other depending on how they are configured.

> 
> 
>- Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes [dd
>if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it was
>equally fast as that of intel S3500 80gb .Does changing my SSD from intel
>s3500 100Gb to Samsung 840 500Gb make any performance  difference here just
>because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize this extra
>speed.Since samsung evo 840 is faster in 4M writes.
> 
Those SSDs would be an even worse choice for endurance/reliability
reasons, though their larger size offsets that a bit.

Unless you have a VERY good understanding and data on how much your
cluster is going to write, pick at the very least SSDs with 3+ DWPD
endurance like the DC S3610s.
In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
need to know what you're doing here.

Christian
> 
> Can somebody help me understand this better.
> 
> Regards,
> Kevin


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster

2017-01-05 Thread jiajia zhong
2017-01-04 23:52 GMT+08:00 Mike Miller :

> Wido, all,
>
> can you point me to the "recent benchmarks" so I can have a look?
> How do you define "performance"? I would not expect cephFS throughput to
> change, but it is surprising to me that metadata on SSD will have no
> measurable effect on latency.
>
> - mike
>
>
operations like "ls", "stat", "find" would become faster, the bottleneck is
the slow osds which store file data.


>
> On 1/3/17 10:49 AM, Wido den Hollander wrote:
>
>>
>> Op 3 januari 2017 om 2:49 schreef Mike Miller :
>>>
>>>
>>> will metadata on SSD improve latency significantly?
>>>
>>>
>> No, as I said in my previous e-mail, recent benchmarks showed that
>> storing CephFS metadata on SSD does not improve performance.
>>
>> It still might be good to do since it's not that much data thus recovery
>> will go quickly, but don't expect a CephFS performance improvement.
>>
>> Wido
>>
>> Mike
>>>
>>> On 1/2/17 11:50 AM, Wido den Hollander wrote:
>>>

 Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo :
>
>
> I've never done migration of cephfs_metadata from spindle disks to
> ssds. But logically you could achieve this through 2 phases.
>
>  #1 Configure CRUSH rule including spindle disks and ssds
>  #2 Configure CRUSH rule for just pointing to ssds
>   * This would cause massive data shuffling.
>

 Not really, usually the CephFS metadata isn't that much data.

 Recent benchmarks (can't find them now) show that storing CephFS
 metadata on SSD doesn't really improve performance though.

 Wido


>
> On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller 
> wrote:
>
>> Hi,
>>
>> Happy New Year!
>>
>> Can anyone point me to specific walkthrough / howto instructions how
>> to move
>> cephfs metadata to SSD in a running cluster?
>>
>> How is crush to be modified step by step such that the metadata
>> migrate to
>> SSD?
>>
>> Thanks and regards,
>>
>> Mike
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
 ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-05 Thread jiajia zhong
2017-01-06 11:10 GMT+08:00 kevin parrikar :

> Hello All,
>
> I have setup a ceph cluster based on 0.94.6 release in  2 servers each
> with 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
> which is connected to a 10G switch with a replica of 2 [ i will add 3 more
> servers to the cluster] and 3 seperate monitor nodes which are vms.
>
> rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
> card with 512Mb cache [ssd is in writeback mode wth BBU]
>
>
> Before installing ceph, i tried to check max throughpit of intel 3500  80G
> SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and
> it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct}
>
> *Observation:*
> Now the cluster is up and running and from the vm i am trying to write a
> 4g file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct .It takes aroud 39 seconds to write.
>
>  during this time ssd journal was showing disk write of 104M on both the
> ceph servers (dstat sdb) and compute node a network transfer rate of ~110M
> on its 10G storage interface(dstat -nN eth2]
>
>
> my questions are:
>
>
>- Is this the best throughput ceph can offer or can anything in my
>environment be optmised to get  more performance? [iperf shows a max
>throughput 9.8Gbits/s]
>
>
>
>- I guess Network/SSD is under utilized and it can handle more writes
>how can this be improved to send more data over network to ssd?
>
> cache tiering?
http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/
or try bcache in kernel.

>
>- rbd kernel module wasn't loaded on compute node,i loaded it manually
>using "modprobe" and later destroyed/re-created vms,but this doesnot give
>any performance boost. So librbd and RBD are equally fast?
>
>
>
>- Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes
>[dd if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it
>was equally fast as that of intel S3500 80gb .Does changing my SSD from
>intel s3500 100Gb to Samsung 840 500Gb make any performance  difference
>here just because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize
>this extra speed.Since samsung evo 840 is faster in 4M writes.
>
>
> Can somebody help me understand this better.
>
> Regards,
> Kevin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-05 Thread kevin parrikar
Hello All,

I have setup a ceph cluster based on 0.94.6 release in  2 servers each with
80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
which is connected to a 10G switch with a replica of 2 [ i will add 3 more
servers to the cluster] and 3 seperate monitor nodes which are vms.

rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
card with 512Mb cache [ssd is in writeback mode wth BBU]


Before installing ceph, i tried to check max throughpit of intel 3500  80G
SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and
it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
oflag=direct}

*Observation:*
Now the cluster is up and running and from the vm i am trying to write a 4g
file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
oflag=direct .It takes aroud 39 seconds to write.

 during this time ssd journal was showing disk write of 104M on both the
ceph servers (dstat sdb) and compute node a network transfer rate of ~110M
on its 10G storage interface(dstat -nN eth2]


my questions are:


   - Is this the best throughput ceph can offer or can anything in my
   environment be optmised to get  more performance? [iperf shows a max
   throughput 9.8Gbits/s]



   - I guess Network/SSD is under utilized and it can handle more writes
   how can this be improved to send more data over network to ssd?



   - rbd kernel module wasn't loaded on compute node,i loaded it manually
   using "modprobe" and later destroyed/re-created vms,but this doesnot give
   any performance boost. So librbd and RBD are equally fast?



   - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes [dd
   if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it was
   equally fast as that of intel S3500 80gb .Does changing my SSD from intel
   s3500 100Gb to Samsung 840 500Gb make any performance  difference here just
   because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize this extra
   speed.Since samsung evo 840 is faster in 4M writes.


Can somebody help me understand this better.

Regards,
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-05 Thread Christian Balzer

Hello,

On Thu, 5 Jan 2017 23:02:51 +0100 Oliver Dzombic wrote:


I've never seen hung qemu tasks, slow/hung I/O tasks inside VMs with a
broken/slow cluster I've seen.
That's because mine are all RBD librbd backed.

I think your approach with cephfs probably isn't the way forward.
Also with cephfs you probably want to run the latest and greatest kernel
there is (4.8?).

Is your cluster logging slow request warnings during that time?

> 
> In the night, thats when this issues occure primary/(only?), we run the
> scrubs and deep scrubs.
> 
> In this time the HDD Utilization of the cold storage peaks to 80-95%.
> 
Never a good thing, if they are also expected to do something useful.
HDD OSDs have their journals inline?

> But we have a SSD hot storage in front of this, which is buffering
> writes and reads.
>
With that you mean cache-tier in writeback mode?
 
> In our ceph.conf we already have this settings active:
> 
> osd max scrubs = 1
> osd scrub begin hour = 20
> osd scrub end hour = 7
> osd op threads = 16
> osd client op priority = 63
> osd recovery op priority = 1
> osd op thread timeout = 5
> 
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 7
> 
You're missing the most powerful scrub dampener there is:
osd_scrub_sleep = 0.1

> 
> 
> All in all i do not think that there is not enough IO for the clients on
> the cold storage ( even it looks like that on the first view ).
>
I find that one of the best ways to understand and thus manage your
cluster is to run something like collectd with graphite (or grafana or
whatever cranks your tractor).

This should in combination with detailed spot analysis by atop or similar
give a very good idea of what is going on.

So in this case, watch cache-tier promotions and flushes, see if your
clients I/Os really are covered by the cache or if during the night your
VMs may do log rotates or access other cold data and thus have to go to
the HDD based OSDs...
 
> And if its really as simple as too view IO for the clients, my question
> would be, how to avoid it ?
> 
> Turning off scrub/deep scrub completely ? That should not be needed and
> is also not too much advisable.
> 
>From where I'm standing deep-scrub is a luxury bling thing of limited
value when compared to something with integrated live checksums as in
Bluestore (so we hope) and BTRFS/ZFS. 

That said, your cluster NEEDs to be able to survive scrubs or it will be
in even bigger trouble when OSDs/nodes fail.

Christian

> We simply can not run less than
> 
> osd max scrubs = 1
> 
> 
> So if scrub is eating away all IO, the scrub algorythem is simply too
> aggressiv.
> 
> Or, and thats most probable i guess, i have some kind of config mistake.
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-05 Thread Oliver Dzombic
Hi David,

thank you for your suggestion.

In the night, thats when this issues occure primary/(only?), we run the
scrubs and deep scrubs.

In this time the HDD Utilization of the cold storage peaks to 80-95%.

But we have a SSD hot storage in front of this, which is buffering
writes and reads.

In our ceph.conf we already have this settings active:

osd max scrubs = 1
osd scrub begin hour = 20
osd scrub end hour = 7
osd op threads = 16
osd client op priority = 63
osd recovery op priority = 1
osd op thread timeout = 5

osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7


--


All in all i do not think that there is not enough IO for the clients on
the cold storage ( even it looks like that on the first view ).

And if its really as simple as too view IO for the clients, my question
would be, how to avoid it ?

Turning off scrub/deep scrub completely ? That should not be needed and
is also not too much advisable.

We simply can not run less than

osd max scrubs = 1


So if scrub is eating away all IO, the scrub algorythem is simply too
aggressiv.

Or, and thats most probable i guess, i have some kind of config mistake.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 05.01.2017 um 22:50 schrieb David Welch:
> Looks like disk i/o is too slow. You can try configuring ceph.conf with
> settings like "osd client op priority"
> 
> http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/
> (which is not loading for me at the moment...)
> 
> On 01/05/2017 04:43 PM, Oliver Dzombic wrote:
>> Hi,
>>
>> any idea of the root cause of this, inside a KVM VM, running qcow2 on
>> cephfs dmesg shows:
>>
>> [846193.473396] ata1.00: status: { DRDY }
>> [846196.231058] ata1: soft resetting link
>> [846196.386714] ata1.01: NODEV after polling detection
>> [846196.391048] ata1.00: configured for MWDMA2
>> [846196.391053] ata1.00: retrying FLUSH 0xea Emask 0x4
>> [846196.391671] ata1: EH complete
>> [1019646.935659] UDP: bad checksum. From 122.224.153.109:46252 to
>> 193.24.210.48:161 ulen 49
>> [1107679.421951] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
>> 0x6 frozen
>> [1107679.423407] ata1.00: failed command: FLUSH CACHE EXT
>> [1107679.424871] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>   res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
>> [1107679.427596] ata1.00: status: { DRDY }
>> [1107684.482035] ata1: link is slow to respond, please be patient
>> (ready=0)
>> [1107689.480237] ata1: device not ready (errno=-16), forcing hardreset
>> [1107689.480267] ata1: soft resetting link
>> [1107689.637701] ata1.00: configured for MWDMA2
>> [1107689.637707] ata1.00: retrying FLUSH 0xea Emask 0x4
>> [1107704.638255] ata1.00: qc timeout (cmd 0xea)
>> [1107704.638282] ata1.00: FLUSH failed Emask 0x4
>> [1107709.687013] ata1: link is slow to respond, please be patient
>> (ready=0)
>> [1107710.095069] ata1: soft resetting link
>> [1107710.246403] ata1.01: NODEV after polling detection
>> [1107710.247225] ata1.00: configured for MWDMA2
>> [1107710.247229] ata1.00: retrying FLUSH 0xea Emask 0x4
>> [1107710.248170] ata1: EH complete
>> [1199723.323256] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
>> 0x6 frozen
>> [1199723.324769] ata1.00: failed command: FLUSH CACHE EXT
>> [1199723.326734] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>   res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
>>
>>
>> Hostmachine is running Kernel 4.5.4
>>
>>
>> Hostmachine dmesg:
>>
>>
>> [1235641.055673] INFO: task qemu-kvm:18287 blocked for more than 120
>> seconds.
>> [1235641.056066]   Not tainted 4.5.4ceph-vps-default #1
>> [1235641.056315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [1235641.056583] qemu-kvmD 8812f939bb58 0 18287  1
>> 0x0080
>> [1235641.056587]  8812f939bb58 881034c02b80 881b7044ab80
>> 8812f939c000
>> [1235641.056590]   7fff 881c7ffd7b70
>> 818c1d90
>> [1235641.056592]  8812f939bb70 818c1525 88103fa16d00
>> 8812f939bc18
>> [1235641.056594] Call Trace:
>> [1235641.056603]  [] ? bit_wait+0x50/0x50
>> [1235641.056605]  [] schedule+0x35/0x80
>> [1235641.056609]  [] schedule_timeout+0x231/0x2d0
>> [1235641.056613]  [] ? ktime_get+0x3c/0xb0
>> [1235641.056622]  [] ? bit_wait+0x50/0x50
>> [1235641.056624]  [] io_schedule_timeout+0xa6/0x110
>> [1235641.056626]  [] bit_wait_io+0x1b/0x60
>> [1235641.056627]  [] __wait_on_bit+0x60/0x90
>> [1235641.056632]  [] wait_on_page_bit+0xcb/0xf0
>> [1235641.056636]  [] ?
>> autoremove_wake_function+0x40/0x40
>> [1235641.056638]  []
>> __filemap_fdatawait_range+0xff/0x180
>> [1235641.056641]  [] 

Re: [ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-05 Thread David Welch
Looks like disk i/o is too slow. You can try configuring ceph.conf with 
settings like "osd client op priority"


http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/
(which is not loading for me at the moment...)

On 01/05/2017 04:43 PM, Oliver Dzombic wrote:

Hi,

any idea of the root cause of this, inside a KVM VM, running qcow2 on
cephfs dmesg shows:

[846193.473396] ata1.00: status: { DRDY }
[846196.231058] ata1: soft resetting link
[846196.386714] ata1.01: NODEV after polling detection
[846196.391048] ata1.00: configured for MWDMA2
[846196.391053] ata1.00: retrying FLUSH 0xea Emask 0x4
[846196.391671] ata1: EH complete
[1019646.935659] UDP: bad checksum. From 122.224.153.109:46252 to
193.24.210.48:161 ulen 49
[1107679.421951] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1107679.423407] ata1.00: failed command: FLUSH CACHE EXT
[1107679.424871] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
  res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[1107679.427596] ata1.00: status: { DRDY }
[1107684.482035] ata1: link is slow to respond, please be patient (ready=0)
[1107689.480237] ata1: device not ready (errno=-16), forcing hardreset
[1107689.480267] ata1: soft resetting link
[1107689.637701] ata1.00: configured for MWDMA2
[1107689.637707] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107704.638255] ata1.00: qc timeout (cmd 0xea)
[1107704.638282] ata1.00: FLUSH failed Emask 0x4
[1107709.687013] ata1: link is slow to respond, please be patient (ready=0)
[1107710.095069] ata1: soft resetting link
[1107710.246403] ata1.01: NODEV after polling detection
[1107710.247225] ata1.00: configured for MWDMA2
[1107710.247229] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107710.248170] ata1: EH complete
[1199723.323256] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1199723.324769] ata1.00: failed command: FLUSH CACHE EXT
[1199723.326734] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
  res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)


Hostmachine is running Kernel 4.5.4


Hostmachine dmesg:


[1235641.055673] INFO: task qemu-kvm:18287 blocked for more than 120
seconds.
[1235641.056066]   Not tainted 4.5.4ceph-vps-default #1
[1235641.056315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1235641.056583] qemu-kvmD 8812f939bb58 0 18287  1
0x0080
[1235641.056587]  8812f939bb58 881034c02b80 881b7044ab80
8812f939c000
[1235641.056590]   7fff 881c7ffd7b70
818c1d90
[1235641.056592]  8812f939bb70 818c1525 88103fa16d00
8812f939bc18
[1235641.056594] Call Trace:
[1235641.056603]  [] ? bit_wait+0x50/0x50
[1235641.056605]  [] schedule+0x35/0x80
[1235641.056609]  [] schedule_timeout+0x231/0x2d0
[1235641.056613]  [] ? ktime_get+0x3c/0xb0
[1235641.056622]  [] ? bit_wait+0x50/0x50
[1235641.056624]  [] io_schedule_timeout+0xa6/0x110
[1235641.056626]  [] bit_wait_io+0x1b/0x60
[1235641.056627]  [] __wait_on_bit+0x60/0x90
[1235641.056632]  [] wait_on_page_bit+0xcb/0xf0
[1235641.056636]  [] ? autoremove_wake_function+0x40/0x40
[1235641.056638]  [] __filemap_fdatawait_range+0xff/0x180
[1235641.056641]  [] ?
__filemap_fdatawrite_range+0xd1/0x100
[1235641.056644]  [] filemap_fdatawait_range+0x14/0x30
[1235641.056646]  []
filemap_write_and_wait_range+0x3f/0x70
[1235641.056649]  [] ceph_fsync+0x69/0x5c0
[1235641.056656]  [] ? do_futex+0xfd/0x530
[1235641.056663]  [] vfs_fsync_range+0x3d/0xb0
[1235641.056668]  [] ?
syscall_trace_enter_phase1+0x139/0x150
[1235641.056670]  [] do_fsync+0x3d/0x70
[1235641.056673]  [] SyS_fdatasync+0x13/0x20
[1235641.056676]  [] entry_SYSCALL_64_fastpath+0x12/0x71


This sometimes happens, on a healthy cluster, running

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

OSD Servers running Kernel 4.5.5

Maybe it will cause the VM to refuse IO and has to be restarted. Maybe
not and it will continue.



Any input is appriciated. Thank you !




--
~~
David Welch
DevOps
ARS
http://thinkars.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-05 Thread Oliver Dzombic
Hi,

any idea of the root cause of this, inside a KVM VM, running qcow2 on
cephfs dmesg shows:

[846193.473396] ata1.00: status: { DRDY }
[846196.231058] ata1: soft resetting link
[846196.386714] ata1.01: NODEV after polling detection
[846196.391048] ata1.00: configured for MWDMA2
[846196.391053] ata1.00: retrying FLUSH 0xea Emask 0x4
[846196.391671] ata1: EH complete
[1019646.935659] UDP: bad checksum. From 122.224.153.109:46252 to
193.24.210.48:161 ulen 49
[1107679.421951] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1107679.423407] ata1.00: failed command: FLUSH CACHE EXT
[1107679.424871] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
 res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[1107679.427596] ata1.00: status: { DRDY }
[1107684.482035] ata1: link is slow to respond, please be patient (ready=0)
[1107689.480237] ata1: device not ready (errno=-16), forcing hardreset
[1107689.480267] ata1: soft resetting link
[1107689.637701] ata1.00: configured for MWDMA2
[1107689.637707] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107704.638255] ata1.00: qc timeout (cmd 0xea)
[1107704.638282] ata1.00: FLUSH failed Emask 0x4
[1107709.687013] ata1: link is slow to respond, please be patient (ready=0)
[1107710.095069] ata1: soft resetting link
[1107710.246403] ata1.01: NODEV after polling detection
[1107710.247225] ata1.00: configured for MWDMA2
[1107710.247229] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107710.248170] ata1: EH complete
[1199723.323256] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1199723.324769] ata1.00: failed command: FLUSH CACHE EXT
[1199723.326734] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
 res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)


Hostmachine is running Kernel 4.5.4


Hostmachine dmesg:


[1235641.055673] INFO: task qemu-kvm:18287 blocked for more than 120
seconds.
[1235641.056066]   Not tainted 4.5.4ceph-vps-default #1
[1235641.056315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1235641.056583] qemu-kvmD 8812f939bb58 0 18287  1
0x0080
[1235641.056587]  8812f939bb58 881034c02b80 881b7044ab80
8812f939c000
[1235641.056590]   7fff 881c7ffd7b70
818c1d90
[1235641.056592]  8812f939bb70 818c1525 88103fa16d00
8812f939bc18
[1235641.056594] Call Trace:
[1235641.056603]  [] ? bit_wait+0x50/0x50
[1235641.056605]  [] schedule+0x35/0x80
[1235641.056609]  [] schedule_timeout+0x231/0x2d0
[1235641.056613]  [] ? ktime_get+0x3c/0xb0
[1235641.056622]  [] ? bit_wait+0x50/0x50
[1235641.056624]  [] io_schedule_timeout+0xa6/0x110
[1235641.056626]  [] bit_wait_io+0x1b/0x60
[1235641.056627]  [] __wait_on_bit+0x60/0x90
[1235641.056632]  [] wait_on_page_bit+0xcb/0xf0
[1235641.056636]  [] ? autoremove_wake_function+0x40/0x40
[1235641.056638]  [] __filemap_fdatawait_range+0xff/0x180
[1235641.056641]  [] ?
__filemap_fdatawrite_range+0xd1/0x100
[1235641.056644]  [] filemap_fdatawait_range+0x14/0x30
[1235641.056646]  []
filemap_write_and_wait_range+0x3f/0x70
[1235641.056649]  [] ceph_fsync+0x69/0x5c0
[1235641.056656]  [] ? do_futex+0xfd/0x530
[1235641.056663]  [] vfs_fsync_range+0x3d/0xb0
[1235641.056668]  [] ?
syscall_trace_enter_phase1+0x139/0x150
[1235641.056670]  [] do_fsync+0x3d/0x70
[1235641.056673]  [] SyS_fdatasync+0x13/0x20
[1235641.056676]  [] entry_SYSCALL_64_fastpath+0x12/0x71


This sometimes happens, on a healthy cluster, running

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

OSD Servers running Kernel 4.5.5

Maybe it will cause the VM to refuse IO and has to be restarted. Maybe
not and it will continue.



Any input is appriciated. Thank you !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu Xenial - Ceph repo uses weak digest algorithm (SHA1)

2017-01-05 Thread Ken Dreyer
Apologies for the thread necromancy :)

We've (finally) configured our signing system to use sha256 for GPG
digests, so this issue should no longer appear on Debian/Ubuntu.

- Ken

On Fri, May 27, 2016 at 6:20 AM, Saverio Proto  wrote:
> I started to use Xenial... does everyone have this error ? :
>
> W: http://ceph.com/debian-hammer/dists/xenial/InRelease: Signature by
> key 08B73419AC32B4E966C1A330E84AC2C0460F3994 uses weak digest
> algorithm (SHA1)
>
> Saverio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover VM Images from Dead Cluster

2017-01-05 Thread L. Bader
Actually these steps didn't work for me - I had an older version of 
Ceph, so I had to upgrade first.


However, the monmap could be restored, but the OSDs were not found anyway.

But I had success now with the Scripts I mentioned before and I was able 
to extract all VM images from the raw OSDs without using any Ceph tools. 
So when having the same issue you can use them to extract your Images.


Thanks for your help!


On 25.12.2016 07:37, Shinobu Kinjo wrote:

On Sun, Dec 25, 2016 at 7:33 AM, Brad Hubbard  wrote:

On Sun, Dec 25, 2016 at 3:33 AM, w...@42on.com  wrote:



Op 24 dec. 2016 om 17:20 heeft L. Bader  het volgende 
geschreven:

Do you have any references on this?

I searched for something like this quite a lot and did not find anything...


No, saw it somewhere on the ML I think, but I  am not sure.

I just know it is in development or on a todo somewhere.

Already done I believe.

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds

That approach should work. But things which you *must* make sure
before taking action are:

  1.You must stop all OSDs which you're going to SSH to before
ceph-objectstore-tool'ing.
  2.You should do that approach *much* more carefully.

So if you have any doubt, please let us know.


HTH.


On 24.12.2016 14:55, w...@42on.com wrote:


Op 24 dec. 2016 om 14:47 heeft L. Bader  het volgende 
geschreven:

Hello,

I have a problem with our (dead) Ceph-Cluster: The configuration seems to be gone 
(deleted / overwritten) and all monitors are gone aswell. However, we do not have 
(up-to-date) backups for all VMs (used with Proxmox) and we would like to recover them 
from "raw" OSDs only (we have all OSDs mounted on one Storage Server). 
Restoring the cluster itself seems impossible.


Work is on it's way iirc to restore MONs from OSD data.

You might want to search for that, the tracker or Github might help.


To recover the VM images I tried to write a simple tool that:
1) searches all OSDs for udata files
2) Sorts them by Image ID
3) Sorts them by "position" / offset
4) Assembles the 4MB blocks to a single file using dd

(See: https://gitlab.lbader.de/kryptur/ceph-recovery/tree/master )

However, for many (nearly all) images there are missing blocks (empty parts I 
guess). So I created a 4MB block of Null-Bytes for each missing parts.

The problem is that the created Image is not usable. fdisk detects partitions 
correctly, but we cannot access the data in any way.

Is there another way to recover the data without having any (working) ceph 
tools?

Greetings and Merry Christmas :)

Lennart

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon Sponsorships Open

2017-01-05 Thread McFarland, Bruce
Patrick,
Do you have any rough idea on when the deadline for submitting
presentation proposals might be? Not to rush you, we’re interested, but
know it might take some time to get internal approval to present at an
outside conference.
Thanks,
Bruce

On 1/4/17, 8:49 AM, "ceph-users on behalf of Patrick McGarry"
 wrote:

>Hey Wes,
>
>We'd love to have you guys. I'll send out another note once we open
>the CFP though, this is just for those who wish to sponsor to help
>make it happen. Thanks for your interest, and keep an eye out for the
>CFP! :)
>
>
>On Thu, Dec 22, 2016 at 2:16 PM, Wes Dillingham
> wrote:
>> I / my group / our organization would be interested in discussing our
>> deployment of Ceph and how we are using it, deploying it, future plans
>>etc.
>> This sounds like an exciting event. We look forward to hearing more
>>details.
>>
>> On Thu, Dec 22, 2016 at 1:44 PM, Patrick McGarry 
>> wrote:
>>>
>>> Hey cephers,
>>>
>>> Just letting you know that we're opening the flood gates for
>>> sponsorship opportunities at Cephalocon next year (23-25 Aug 2017,
>>> Boston, MA). If you would be interested in sponsoring/exhibiting at
>>> our inaugural Ceph conference, please drop me a line. Thanks!
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>>
>>> Patrick McGarry
>>> Director Ceph Community || Red Hat
>>> http://ceph.com  ||  http://community.redhat.com
>>> @scuttlemonkey || @ceph
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> --
>> Respectfully,
>>
>> Wes Dillingham
>> wes_dilling...@harvard.edu
>> Research Computing | Infrastructure Engineer
>> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
>>
>
>
>
>-- 
>
>Best Regards,
>
>Patrick McGarry
>Director Ceph Community || Red Hat
>http://ceph.com  ||  http://community.redhat.com
>@scuttlemonkey || @ceph
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirroring

2017-01-05 Thread Jason Dillaman
On Thu, Jan 5, 2017 at 7:24 AM, Klemen Pogacnik  wrote:
> I'm playing with rbd mirroring with openstack. The final idea is to use it
> for disaster recovery of DB server running on Openstack cluster, but would
> like to test this functionality first.
> I've prepared this configuration:
> - 2 openstack clusters (devstacks)
> - 2 ceph clusters (one node clusters)
> Remote Ceph is used as a backend for Cinder service. Each devstack has its
> own Ceph cluster. Mirroring was enabled for volumes pool, and rbd-mirror
> daemon was started.
> When I create new cinder volume on devstack1, the same rbd storage appeared
> on both Ceph clusters, so it seems, mirroring is working.
> Now I would like to see this storage as a Cinder volume on devstack2 too. Is
> it somehow possible to do that?

This level is HA/DR is not currently built-in to OpenStack (it's
outside the scope of Ceph). There are several strategies you could use
to try to replicate the devstack1 database to devstack2. Here is a
presentation from OpenStack Summit Austin [1] re: this subject.

> The next question is, how to make switchover. On Ceph it can easily be done
> by demote and promote commands, but then the volumes are still not seen on
> Devstack2, so I can't use it.
> On open stack there is cinder failover-host command, which is, as I can
> understand, only useful for configuration with one openstack and two ceph
> clusters. Any idea how to make switchover with my configuration.
> Thanks a lot for help!

Correct -- Cinder's built-in volume replication feature is just a set
of hooks available for backends that already support
replication/mirroring. The hooks for Ceph RBD are scheduled to be
included in the next release of OpenStack, but as you have discovered,
it really only protects against a storage failure (where you can
switch from Ceph cluster A to Ceph cluster B), but does not help with
losing your OpenStack data center.

> Kemo
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[1] 
https://www.openstack.org/videos/video/protecting-the-galaxy-multi-region-disaster-recovery-with-openstack-and-ceph

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW pool usage is higher that total bucket size

2017-01-05 Thread Wido den Hollander

> Op 5 januari 2017 om 10:08 schreef Luis Periquito :
> 
> 
> Hi,
> 
> I have a cluster with RGW in which one bucket is really big, so every
> so often we delete stuff from it.
> 
> That bucket is now taking 3.3T after we deleted just over 1T from it.
> That was done last week.
> 
> The pool (.rgw.buckets) is using 5.1T, and before the deletion was
> taking almost 6T.
> 
> How can I delete old data from the pool? Currently the pool is using
> 1.5T more than the sum of the buckets...
> 
> radosgw-admin gc list  --include-all returns a few objects that were
> recently deleted (whose time stamp are a couple of hours in the
> future).
> 
> This cluster was originally Hammer, has since been upgraded to
> Infernalis and now running Jewel (10.2.3).
> 
> Any ideas on recovering the lost space?
> 

Have you tried to use a orphan scan using radosgw-admin scan?

You might be hitting this:

- http://tracker.ceph.com/issues/18331
- http://tracker.ceph.com/issues/18258

RGW seems to leak/orphan objects in RADOS.

Wido

> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD mirroring

2017-01-05 Thread Klemen Pogacnik
I'm playing with rbd mirroring with openstack. The final idea is to use it
for disaster recovery of DB server running on Openstack cluster, but would
like to test this functionality first.
I've prepared this configuration:
- 2 openstack clusters (devstacks)
- 2 ceph clusters (one node clusters)
Remote Ceph is used as a backend for Cinder service. Each devstack has its
own Ceph cluster. Mirroring was enabled for volumes pool, and rbd-mirror
daemon was started.
When I create new cinder volume on devstack1, the same rbd storage appeared
on both Ceph clusters, so it seems, mirroring is working.
Now I would like to see this storage as a Cinder volume on devstack2 too.
Is it somehow possible to do that?
The next question is, how to make switchover. On Ceph it can easily be done
by demote and promote commands, but then the volumes are still not seen on
Devstack2, so I can't use it.
On open stack there is cinder failover-host command, which is, as I can
understand, only useful for configuration with one openstack and two ceph
clusters. Any idea how to make switchover with my configuration.
Thanks a lot for help!
Kemo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW pool usage is higher that total bucket size

2017-01-05 Thread Luis Periquito
Hi,

I have a cluster with RGW in which one bucket is really big, so every
so often we delete stuff from it.

That bucket is now taking 3.3T after we deleted just over 1T from it.
That was done last week.

The pool (.rgw.buckets) is using 5.1T, and before the deletion was
taking almost 6T.

How can I delete old data from the pool? Currently the pool is using
1.5T more than the sum of the buckets...

radosgw-admin gc list  --include-all returns a few objects that were
recently deleted (whose time stamp are a couple of hours in the
future).

This cluster was originally Hammer, has since been upgraded to
Infernalis and now running Jewel (10.2.3).

Any ideas on recovering the lost space?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High OSD apply latency right after new year (the leap second?)

2017-01-05 Thread Craig Chi
Hi ,

I'm glad to know that it happened not only to me.
Though it is unharmful, it seems like kind of bug...
Are there any Ceph developers who know how exactly is the implementation of 
"ceph osd perf" command?
Is the leap second really responsible for this behavior?
Thanks.

Sincerely,
Craig Chi

On 2017-01-04 19:55, Alexandre DERUMIERwrote:
> yes, same here on 3 productions clusters. no impact, but a nice happy new 
> year alert ;) Seem that google provide ntp servers to avoid brutal 1 second 
> leap https://developers.google.com/time/smear - Mail original - De: 
> "Craig Chi"À: 
> "ceph-users"Envoyé: Mercredi 4 Janvier 2017 
> 11:26:21 Objet: [ceph-users] High OSD apply latency right after new year (the 
> leap second?) Hi List, Three of our Ceph OSDs got unreasonably high latency 
> right after the first second of the new year (2017/01/01 00:00:00 UTC, I have 
> attached the metrics and I am in UTC+8 timezone). There is exactly a pg 
> (size=3) just contains these 3 OSDs. The OSD apply latency is usually up to 
> 25 minutes, and I can also see this large number randomly when I execute 
> "ceph osd perf" command. But the 3 OSDs does not have strange behavior and 
> are performing fine so far. I have no idea how "ceph osd perf" is 
> implemented, but does it have relation to the leap second this year? Since 
> the cluster is not on production, and the developers were all celebrating new 
> year at that time, I can not think of other possibilities. Do your cluster 
> also get this interestingly unexpected new year's gift too? Sincerely, Craig 
> Chi ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com