Re: [ceph-users] Fwd: down+peering PGs, can I move PGs from one OSD to another

2018-08-03 Thread Sean Patronis
Forgive the wall of text, i shortened it a little here is the osd
log when I attempt to start the osd:

2018-08-04 03:53:28.917418 7f3102aa87c0  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-21) detect_feature: extsize is
disabled by conf
2018-08-04 03:53:28.977564 7f3102aa87c0  0
filestore(/var/lib/ceph/osd/ceph-21) mount: WRITEAHEAD journal mode
explicitly enabled in conf
2018-08-04 03:53:29.001967 7f3102aa87c0 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force use of
aio anyway
2018-08-04 03:53:29.001981 7f3102aa87c0  1 journal _open
/var/lib/ceph/osd/ceph-21/journal fd 21: 2147483648 bytes, block size 4096
bytes, directio = 1, aio = 0
2018-08-04 03:53:29.002030 7f3102aa87c0  1 journal _open
/var/lib/ceph/osd/ceph-21/journal fd 21: 2147483648 bytes, block size 4096
bytes, directio = 1, aio = 0
2018-08-04 03:53:29.255501 7f3102aa87c0  0 
cls/hello/cls_hello.cc:271: loading cls_hello
2018-08-04 03:53:29.335038 7f3102aa87c0  0 osd.21 19579 crush map has
features 1107558400, adjusting msgr requires for clients
2018-08-04 03:53:29.335058 7f3102aa87c0  0 osd.21 19579 crush map has
features 1107558400, adjusting msgr requires for mons
2018-08-04 03:53:29.335062 7f3102aa87c0  0 osd.21 19579 crush map has
features 1107558400, adjusting msgr requires for osds
2018-08-04 03:53:29.335077 7f3102aa87c0  0 osd.21 19579 load_pgs
2018-08-04 03:54:00.275885 7f3102aa87c0 -1 osd/PG.cc: In function 'static
epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
ceph::bufferlist*)' thread 7f3102aa87c0 time 2018-08-04 03:54:00.274454
osd/PG.cc: 2577: FAILED assert(values.size() == 1)

 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
ceph::buffer::list*)+0x578) [0x741a18]
 2: (OSD::load_pgs()+0x1993) [0x655d13]
 3: (OSD::init()+0x1ba1) [0x65fff1]
 4: (main()+0x1ea7) [0x602fd7]
 5: (__libc_start_main()+0xed) [0x7f31008a276d]
 6: /usr/bin/ceph-osd() [0x607119]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
 -3406> 2018-08-04 03:53:24.680985 7f3102aa87c0  5 asok(0x3c40230)
register_command perfcounters_dump hook 0x3c1c010
 -3405> 2018-08-04 03:53:24.681040 7f3102aa87c0  5 asok(0x3c40230)
register_command 1 hook 0x3c1c010
 -3404> 2018-08-04 03:53:24.681046 7f3102aa87c0  5 asok(0x3c40230)
register_command perf dump hook 0x3c1c010
 -3403> 2018-08-04 03:53:24.681052 7f3102aa87c0  5 asok(0x3c40230)
register_command perfcounters_schema hook 0x3c1c010
 -3402> 2018-08-04 03:53:24.681055 7f3102aa87c0  5 asok(0x3c40230)
register_command 2 hook 0x3c1c010
 -3401> 2018-08-04 03:53:24.681058 7f3102aa87c0  5 asok(0x3c40230)
register_command perf schema hook 0x3c1c010
 -3400> 2018-08-04 03:53:24.681061 7f3102aa87c0  5 asok(0x3c40230)
register_command config show hook 0x3c1c010
 -3399> 2018-08-04 03:53:24.681064 7f3102aa87c0  5 asok(0x3c40230)
register_command config set hook 0x3c1c010
 -3398> 2018-08-04 03:53:24.681095 7f3102aa87c0  5 asok(0x3c40230)
register_command config get hook 0x3c1c010
 -3397> 2018-08-04 03:53:24.681101 7f3102aa87c0  5 asok(0x3c40230)
register_command log flush hook 0x3c1c010
 -3396> 2018-08-04 03:53:24.681108 7f3102aa87c0  5 asok(0x3c40230)
register_command log dump hook 0x3c1c010
 -3395> 2018-08-04 03:53:24.681116 7f3102aa87c0  5 asok(0x3c40230)
register_command log reopen hook 0x3c1c010
 -3394> 2018-08-04 03:53:24.689976 7f3102aa87c0  0 ceph version 0.80.4
(7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process ceph-osd, pid 51827
 -3393> 2018-08-04 03:53:24.727583 7f3102aa87c0  1 -- 192.168.0.4:0/0
learned my addr 192.168.0.4:0/0
 -3392> 2018-08-04 03:53:24.727613 7f3102aa87c0  1 accepter.accepter.bind
my_inst.addr is 192.168.0.4:6801/51827 need_addr=0
 -3391> 2018-08-04 03:53:24.727638 7f3102aa87c0  1 -- 192.168.1.3:0/0
learned my addr 192.168.1.3:0/0
 -3390> 2018-08-04 03:53:24.727652 7f3102aa87c0  1 accepter.accepter.bind
my_inst.addr is 192.168.1.3:6800/51827 need_addr=0
 -3389> 2018-08-04 03:53:24.727676 7f3102aa87c0  1 -- 192.168.1.3:0/0
learned my addr 192.168.1.3:0/0
 -3388> 2018-08-04 03:53:24.727687 7f3102aa87c0  1 accepter.accepter.bind
my_inst.addr is 192.168.1.3:6801/51827 need_addr=0
 -3387> 2018-08-04 03:53:24.727722 7f3102aa87c0  1 -- 192.168.0.4:0/0
learned my addr 192.168.0.4:0/0
 -3386> 2018-08-04 03:53:24.727732 7f3102aa87c0  1 accepter.accepter.bind
my_inst.addr is 192.168.0.4:6810/51827 need_addr=0
 -3385> 2018-08-04 03:53:24.727767 7f3102aa87c0  1 -- 192.168.0.4:0/0
learned my addr 192.168.0.4:0/0
 -3384> 2018-08-04 03:53:24.72 7f3102aa87c0  1 accepter.accepter.bind
my_inst.addr is 192.168.0.4:6811/51827 need_addr=0
 -3383> 2018-08-04 03:53:24.728871 7f3102aa87c0  1 finished
global_init_daemonize
 -3382> 2018-08-04 03:53:24.761702 7f3102aa87c0  5 asok(0x3c40230) init
/var/run/ceph/ceph-osd.21.asok
 -3381> 2018-08-04 03:53:24.761737 7f3102aa87c0  5 asok(0x3c40230)
bind_and_listen 

[ceph-users] Inconsistent PGs every few days

2018-08-03 Thread Dimitri Roschkowski

Hi,

I run a cluster with 7 OSD. The cluster has no much traffic on it. But 
every few days, I get a HEALTH_ERR, because of inconsistent PGs:


root@Sam ~ # ceph status
  cluster:
id: c4bfc288-8ba8-4c3a-b3a6-ed95503f50b7
health: HEALTH_ERR
3 scrub errors
Possible data damage: 3 pgs inconsistent

  services:
mon: 1 daemons, quorum mon1
mgr: ceph-osd1(active)
mds: FS-1/1/1 up  {0=ceph-osd1=up:active}
osd: 11 osds: 8 up, 7 in
rgw: 1 daemon active

  data:
pools:   6 pools, 168 pgs
objects: 901.8 k objects, 2.6 TiB
usage: 7.9 TiB used, 7.4 TiB / 15 TiB avail
pgs: 165 active+clean
 3   active+clean+inconsistent

  io:
client:   641 KiB/s wr, 0 op/s rd, 3 op/s wr


root@Sam ~ # ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 5.1d is active+clean+inconsistent, acting [6,8,3]
pg 5.20 is active+clean+inconsistent, acting [3,9,0]
pg 5.4a is active+clean+inconsistent, acting [6,3,7]

What's the reason for this problem? How can I analyse it?

Cheers, Dimitri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: down+peering PGs, can I move PGs from one OSD to another

2018-08-03 Thread Sean Redmond
Hi,

You can export and import PG's using ceph_objectstore_tool, but if the osd
won't start you may have trouble exporting a PG.

It maybe useful to share the errors you get when trying to start the osd.

Thanks

On Fri, Aug 3, 2018 at 10:13 PM, Sean Patronis  wrote:

>
>
> Hi all.
>
> We have an issue with some down+peering PGs (I think), when I try to mount or 
> access data the requests are blocked:
>
> 114891/7509353 objects degraded (1.530%)
>  887 stale+active+clean
>1 peering
>   54 active+recovery_wait
>19609 active+clean
>   91 active+remapped+wait_backfill
>   10 active+recovering
>1 active+clean+scrubbing+deep
>9 down+peering
>   10 active+remapped+backfilling
> recovery io 67324 kB/s, 10 objects/s
>
> when I query one of these down+peering PGs, I can see the following:
>
>  "peering_blocked_by": [
> { "osd": 7,
>   "current_lost_at": 0,
>   "comment": "starting or marking this osd lost may let us 
> proceed"},
> { "osd": 21,
>   "current_lost_at": 0,
>   "comment": "starting or marking this osd lost may let us 
> proceed"}]},
> { "name": "Started",
>   "enter_time": "2018-08-01 07:06:16.806339"}],
>
>
>
> Both of these OSDs (7 and 21) will not come back up and in with ceph due
> to some errors, but I can mount the disks and read data off of them.  Can I
> manually move/copy these PGs off of these down and out OSDs and put them on
> a good OSD?
>
> This is an older ceph cluster running firefly.
>
> Thanks.
>
>
>
>
> This email message may contain privileged or confidential information, and
> is for the use of intended recipients only. Do not share with or forward to
> additional parties except as necessary to conduct the business for which
> this email (and attachments) was clearly intended. If you have received
> this message in error, please immediately advise the sender by reply email
> and then delete this message.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: down+peering PGs, can I move PGs from one OSD to another

2018-08-03 Thread Sean Patronis
Hi all.

We have an issue with some down+peering PGs (I think), when I try to
mount or access data the requests are blocked:

114891/7509353 objects degraded (1.530%)
 887 stale+active+clean
   1 peering
  54 active+recovery_wait
   19609 active+clean
  91 active+remapped+wait_backfill
  10 active+recovering
   1 active+clean+scrubbing+deep
   9 down+peering
  10 active+remapped+backfilling
recovery io 67324 kB/s, 10 objects/s

when I query one of these down+peering PGs, I can see the following:

 "peering_blocked_by": [
{ "osd": 7,
  "current_lost_at": 0,
  "comment": "starting or marking this osd lost may
let us proceed"},
{ "osd": 21,
  "current_lost_at": 0,
  "comment": "starting or marking this osd lost may
let us proceed"}]},
{ "name": "Started",
  "enter_time": "2018-08-01 07:06:16.806339"}],



Both of these OSDs (7 and 21) will not come back up and in with ceph due to
some errors, but I can mount the disks and read data off of them.  Can I
manually move/copy these PGs off of these down and out OSDs and put them on
a good OSD?

This is an older ceph cluster running firefly.

Thanks.

-- 
This email message may contain privileged or confidential information, and 
is for the use of intended recipients only. Do not share with or forward to 
additional parties except as necessary to conduct the business for which 
this email (and attachments) was clearly intended. If you have received 
this message in error, please immediately advise the sender by reply email 
and then delete this message.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FileStore SSD (journal) vs BlueStore SSD (DB/Wal)

2018-08-03 Thread Xavier Trilla
Hi Sam,

Having done any benchmark myself -as we only use SSDs or NVMes- but is my 
understanding Luminous -I would not recommend upgrading production to Mimic 
yet, but I’m quite conservative- Bluestore is going to be slower for writes 
than filestore with SSD journals.

You could try dmcache, bcache, etc and add some SSD caching to each HDD  
(Meaning it can affect write endurance of the SSDs).

Dmcache and bluestore seems to be a quite interesting option IMO, as you’ll get 
faster reads and writes, and you’ll avoid the double write penalty of filestore.

Cheers!

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

De: ceph-users  En nombre de Sam Huracan
Enviado el: viernes, 3 de agosto de 2018 16:36
Para: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] FileStore SSD (journal) vs BlueStore SSD (DB/Wal)

Hi,

Anyone can help us answer these questions?



2018-08-03 8:36 GMT+07:00 Sam Huracan 
mailto:nowitzki.sa...@gmail.com>>:
Hi Cephers,

We intend to upgrade our Cluster from Jewel to Luminous (or Mimic?)

Our model is currently using OSD File Store with SSD Journal (1 SSD for 7 SATA 
7.2K)

My question are:


1.Should we change to BlueStore with DB/WAL put in SSD and data in HDD? (we 
want to keep the model using journal SSD for caching). Is there any improvement 
in overall performance? We think with model SSD cache, FileStore will write 
faster because data written in SSD before flushing to SAS, whereas with 
BlueStore, data will be written directly to SAS.


2. Do you guys ever benchmark and compare 2 cluster: FileStore SSD (journal)  
and BlueStore SSD (DB/WAL) like that?


Thanks in advance.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW problems after upgrade to Luminous

2018-08-03 Thread David Turner
I came across you mentioning bucket check --fix before, but I totally
forgot that I should be passing --bucket=mybucket with the command to
actually do anything.  I'm running this now and it seems to actually be
doing something.  My guess was that it was stuck in the state and now that
I can clean up the bucket I should be able to try resharding it again.
Thank you so much.

On Fri, Aug 3, 2018 at 12:50 PM Yehuda Sadeh-Weinraub 
wrote:

> Oh, also -- one thing that might work is running bucket check --fix on
> the bucket. That should overwrite the reshard status field in the
> bucket index.
>
> Let me know if it happens to fix the issue for you.
>
> Yehuda.
>
> On Fri, Aug 3, 2018 at 9:46 AM, Yehuda Sadeh-Weinraub 
> wrote:
> > Is it actually resharding, or is it just stuck in that state?
> >
> > On Fri, Aug 3, 2018 at 7:55 AM, David Turner 
> wrote:
> >> I am currently unable to write any data to this bucket in this current
> >> state.  Does anyone have any ideas for reverting to the original index
> >> shards and cancel the reshard processes happening to the bucket?
> >>
> >> On Thu, Aug 2, 2018 at 12:32 PM David Turner 
> wrote:
> >>>
> >>> I upgraded my last cluster to Luminous last night.  It had some very
> large
> >>> bucket indexes on Jewel which caused a couple problems with the
> upgrade, but
> >>> finally everything finished and we made it to the other side, but now
> I'm
> >>> having problems with [1] these errors populating a lot of our RGW logs
> and
> >>> clients seeing the time skew error responses.  The time stamps between
> the
> >>> client nodes, rgw nodes, and the rest of the ceph cluster match
> perfectly
> >>> and actually build off of the same ntp server.
> >>>
> >>> I tried disabling dynamic resharding for the RGW daemons by placing
> this
> >>> in the ceph.conf for the affected daemons `rgw_dynamic_resharding =
> false`
> >>> and restarting them as well as issuing a reshard cancel for the
> bucket, but
> >>> nothing seems to actually stop the reshard from processing.  Here's the
> >>> output of a few commands.  [2] reshard list [3] reshard status
> >>>
> >>> Are there any things we can do to actually disable bucket resharding or
> >>> let it finish?  I'm stuck on ideas.  I've tried quite a few things I've
> >>> found around except for manually resharding which is a last resort
> here.
> >>> This bucket won't exist in a couple months and the performance is good
> >>> enough without resharding, but I don't know how to get it to stop.
> Thanks.
> >>>
> >>>
> >>> [1] 2018-08-02 16:22:16.047387 7fbe82e61700  0 NOTICE: resharding
> >>> operation on bucket index detected, blocking
> >>> 2018-08-02 16:22:16.206950 7fbe8de77700  0 block_while_resharding
> ERROR:
> >>> bucket is still resharding, please retry
> >>> 2018-08-02 16:22:12.253734 7fbe4f5fa700  0 NOTICE: request time skew
> too
> >>> big now=2018-08-02 16:22:12.00 req_time=2018-08-02 16:06:03.00
> >>>
> >>> [2] $ radosgw-admin reshard list
> >>> [2018-08-02 16:13:19.082172 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.10
> >>> 2018-08-02 16:13:19.082757 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.11
> >>> 2018-08-02 16:13:19.083941 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.12
> >>> 2018-08-02 16:13:19.085170 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.13
> >>> 2018-08-02 16:13:19.085898 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.14
> >>> ]
> >>> 2018-08-02 16:13:19.086476 7f3ca4163c80 -1 ERROR: failed to list
> reshard
> >>> log entries, oid=reshard.15
> >>>
> >>> [3] $ radosgw-admin reshard status --bucket my-bucket
> >>> [
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> "new_bucket_instance_id":
> >>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> >>> "num_shards": 32
> >>> },
> >>> {
> >>> "reshard_status": 1,
> >>> 

Re: [ceph-users] RGW problems after upgrade to Luminous

2018-08-03 Thread Yehuda Sadeh-Weinraub
Oh, also -- one thing that might work is running bucket check --fix on
the bucket. That should overwrite the reshard status field in the
bucket index.

Let me know if it happens to fix the issue for you.

Yehuda.

On Fri, Aug 3, 2018 at 9:46 AM, Yehuda Sadeh-Weinraub  wrote:
> Is it actually resharding, or is it just stuck in that state?
>
> On Fri, Aug 3, 2018 at 7:55 AM, David Turner  wrote:
>> I am currently unable to write any data to this bucket in this current
>> state.  Does anyone have any ideas for reverting to the original index
>> shards and cancel the reshard processes happening to the bucket?
>>
>> On Thu, Aug 2, 2018 at 12:32 PM David Turner  wrote:
>>>
>>> I upgraded my last cluster to Luminous last night.  It had some very large
>>> bucket indexes on Jewel which caused a couple problems with the upgrade, but
>>> finally everything finished and we made it to the other side, but now I'm
>>> having problems with [1] these errors populating a lot of our RGW logs and
>>> clients seeing the time skew error responses.  The time stamps between the
>>> client nodes, rgw nodes, and the rest of the ceph cluster match perfectly
>>> and actually build off of the same ntp server.
>>>
>>> I tried disabling dynamic resharding for the RGW daemons by placing this
>>> in the ceph.conf for the affected daemons `rgw_dynamic_resharding = false`
>>> and restarting them as well as issuing a reshard cancel for the bucket, but
>>> nothing seems to actually stop the reshard from processing.  Here's the
>>> output of a few commands.  [2] reshard list [3] reshard status
>>>
>>> Are there any things we can do to actually disable bucket resharding or
>>> let it finish?  I'm stuck on ideas.  I've tried quite a few things I've
>>> found around except for manually resharding which is a last resort here.
>>> This bucket won't exist in a couple months and the performance is good
>>> enough without resharding, but I don't know how to get it to stop.  Thanks.
>>>
>>>
>>> [1] 2018-08-02 16:22:16.047387 7fbe82e61700  0 NOTICE: resharding
>>> operation on bucket index detected, blocking
>>> 2018-08-02 16:22:16.206950 7fbe8de77700  0 block_while_resharding ERROR:
>>> bucket is still resharding, please retry
>>> 2018-08-02 16:22:12.253734 7fbe4f5fa700  0 NOTICE: request time skew too
>>> big now=2018-08-02 16:22:12.00 req_time=2018-08-02 16:06:03.00
>>>
>>> [2] $ radosgw-admin reshard list
>>> [2018-08-02 16:13:19.082172 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.10
>>> 2018-08-02 16:13:19.082757 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.11
>>> 2018-08-02 16:13:19.083941 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.12
>>> 2018-08-02 16:13:19.085170 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.13
>>> 2018-08-02 16:13:19.085898 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.14
>>> ]
>>> 2018-08-02 16:13:19.086476 7f3ca4163c80 -1 ERROR: failed to list reshard
>>> log entries, oid=reshard.15
>>>
>>> [3] $ radosgw-admin reshard status --bucket my-bucket
>>> [
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> "reshard_status": 1,
>>> "new_bucket_instance_id":
>>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>>> "num_shards": 32
>>> },
>>> {
>>> 

Re: [ceph-users] RGW problems after upgrade to Luminous

2018-08-03 Thread Yehuda Sadeh-Weinraub
Is it actually resharding, or is it just stuck in that state?

On Fri, Aug 3, 2018 at 7:55 AM, David Turner  wrote:
> I am currently unable to write any data to this bucket in this current
> state.  Does anyone have any ideas for reverting to the original index
> shards and cancel the reshard processes happening to the bucket?
>
> On Thu, Aug 2, 2018 at 12:32 PM David Turner  wrote:
>>
>> I upgraded my last cluster to Luminous last night.  It had some very large
>> bucket indexes on Jewel which caused a couple problems with the upgrade, but
>> finally everything finished and we made it to the other side, but now I'm
>> having problems with [1] these errors populating a lot of our RGW logs and
>> clients seeing the time skew error responses.  The time stamps between the
>> client nodes, rgw nodes, and the rest of the ceph cluster match perfectly
>> and actually build off of the same ntp server.
>>
>> I tried disabling dynamic resharding for the RGW daemons by placing this
>> in the ceph.conf for the affected daemons `rgw_dynamic_resharding = false`
>> and restarting them as well as issuing a reshard cancel for the bucket, but
>> nothing seems to actually stop the reshard from processing.  Here's the
>> output of a few commands.  [2] reshard list [3] reshard status
>>
>> Are there any things we can do to actually disable bucket resharding or
>> let it finish?  I'm stuck on ideas.  I've tried quite a few things I've
>> found around except for manually resharding which is a last resort here.
>> This bucket won't exist in a couple months and the performance is good
>> enough without resharding, but I don't know how to get it to stop.  Thanks.
>>
>>
>> [1] 2018-08-02 16:22:16.047387 7fbe82e61700  0 NOTICE: resharding
>> operation on bucket index detected, blocking
>> 2018-08-02 16:22:16.206950 7fbe8de77700  0 block_while_resharding ERROR:
>> bucket is still resharding, please retry
>> 2018-08-02 16:22:12.253734 7fbe4f5fa700  0 NOTICE: request time skew too
>> big now=2018-08-02 16:22:12.00 req_time=2018-08-02 16:06:03.00
>>
>> [2] $ radosgw-admin reshard list
>> [2018-08-02 16:13:19.082172 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.10
>> 2018-08-02 16:13:19.082757 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.11
>> 2018-08-02 16:13:19.083941 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.12
>> 2018-08-02 16:13:19.085170 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.13
>> 2018-08-02 16:13:19.085898 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.14
>> ]
>> 2018-08-02 16:13:19.086476 7f3ca4163c80 -1 ERROR: failed to list reshard
>> log entries, oid=reshard.15
>>
>> [3] $ radosgw-admin reshard status --bucket my-bucket
>> [
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
>> "num_shards": 32
>> },
>> {
>> "reshard_status": 1,
>> "new_bucket_instance_id":
>> 

Re: [ceph-users] Ceph Balancer per Pool/Crush Unit

2018-08-03 Thread Reed Dier
I suppose I may have found the solution I was unaware existed.

> balancer optimize  { [...]} :  Run optimizer to create a 
> new plan

So apparently you can create a plan specific to a pool(s).
So just to double check this, I created two plans, plan1 with the hdd pool (and 
not the ssd pool); plan2 with no arguments.

I then ran ceph balancer show planN and also ceph osd crush weight-set dump.
Then compared the values in the weight-set dump against the values in the two 
plans, and concluded that plan1 did not adjust the values for ssd osd’s, which 
is exactly what I was looking for:

> ID  CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
> -1417.61093 host ceph00
>  24   ssd   1.76109 osd.24   up  1.0 1.0
>  25   ssd   1.76109 osd.25   up  1.0 1.0
>  26   ssd   1.76109 osd.26   up  1.0 1.0
>  27   ssd   1.76109 osd.27   up  1.0 1.0
>  28   ssd   1.76109 osd.28   up  1.0 1.0
>  29   ssd   1.76109 osd.29   up  1.0 1.0
>  30   ssd   1.76109 osd.30   up  1.0 1.0
>  31   ssd   1.76109 osd.31   up  1.0 1.0
>  32   ssd   1.76109 osd.32   up  1.0 1.0
>  33   ssd   1.76109 osd.33   up  1.0 1.0


ceph osd crush weight-set dump
>{
> "bucket_id": -14,
> "weight_set": [
> [
> 1.756317,
> 1.613647,
> 1.733200,
> 1.735825,
> 1.961304,
> 1.583069,
> 1.963791,
> 1.773041,
> 1.890228,
> 1.793457
> ]
> ]
> },


plan1 (no change)
> ceph osd crush weight-set reweight-compat 24 1.756317
> ceph osd crush weight-set reweight-compat 25 1.613647
> ceph osd crush weight-set reweight-compat 26 1.733200
> ceph osd crush weight-set reweight-compat 27 1.735825
> ceph osd crush weight-set reweight-compat 28 1.961304
> ceph osd crush weight-set reweight-compat 29 1.583069
> ceph osd crush weight-set reweight-compat 30 1.963791
> ceph osd crush weight-set reweight-compat 31 1.773041
> ceph osd crush weight-set reweight-compat 32 1.890228
> ceph osd crush weight-set reweight-compat 33 1.793457


plan2 (change)
> ceph osd crush weight-set reweight-compat 24 1.742185
> ceph osd crush weight-set reweight-compat 25 1.608330
> ceph osd crush weight-set reweight-compat 26 1.753393
> ceph osd crush weight-set reweight-compat 27 1.713531
> ceph osd crush weight-set reweight-compat 28 1.964446
> ceph osd crush weight-set reweight-compat 29 1.629001
> ceph osd crush weight-set reweight-compat 30 1.961968
> ceph osd crush weight-set reweight-compat 31 1.738253
> ceph osd crush weight-set reweight-compat 32 1.884098
> ceph osd crush weight-set reweight-compat 33 1.779180


Hopefully this will be helpful for someone else who overlooks this in the -h 
output.

Reed

> On Aug 1, 2018, at 6:05 PM, Reed Dier  wrote:
> 
> Hi Cephers,
> 
> I’m starting to play with the Ceph Balancer plugin after moving to straw2 and 
> running into something I’m surprised I haven’t seen posted here.
> 
> My cluster has two crush roots, one for HDD, one for SSD.
> 
> Right now, HDD’s are a single pool to themselves, SSD’s are a single pool to 
> themselves.
> 
> Using Ceph Balancer Eval, I can see the eval score for the hdd’s (worse), and 
> the ssd’s (better), and the blended score of the cluster overall.
> pool “hdd" score 0.012529 (lower is better)
> pool “ssd" score 0.004654 (lower is better)
> current cluster score 0.008484 (lower is better)
> 
> My problem is that I need to get my hdd’s better, and stop touching my ssd's, 
> because shuffling data wear’s the ssd's unnecessarily, and it has actually 
> gotten the distribution worse over time. https://imgur.com/RVh0jfH 
> 
> You can see that between 06:00 and 09:00 on the second day in the graph that 
> the spread was very tight, and then it expanded back.
> 
> So my question is, how can I run the balancer on just my hdd’s without 
> touching my ssd’s?
> 
> I removed about 15% of the PG’s living on the HDD’s because they were empty.
> I also have two tiers of HDD’s 8TB’s and 2TB’s, but they are roughly equally 
> weighted in crush at the chassis level where my failure domains are 
> configured.
> Hopefully this abbreviated ceph osd tree displays the hierarchy. Multipliers 
> for that bucket on right.
>> ID  CLASS WEIGHTTYPE NAME
>>  -1   218.49353 root default.hdd
>> -10   218.49353 rack default.rack-hdd
>> -7043.66553 chassis hdd-2tb-chassis1 *1
>> -6743.66553 host 

Re: [ceph-users] RGW problems after upgrade to Luminous

2018-08-03 Thread David Turner
I am currently unable to write any data to this bucket in this current
state.  Does anyone have any ideas for reverting to the original index
shards and cancel the reshard processes happening to the bucket?

On Thu, Aug 2, 2018 at 12:32 PM David Turner  wrote:

> I upgraded my last cluster to Luminous last night.  It had some very large
> bucket indexes on Jewel which caused a couple problems with the upgrade,
> but finally everything finished and we made it to the other side, but now
> I'm having problems with [1] these errors populating a lot of our RGW logs
> and clients seeing the time skew error responses.  The time stamps between
> the client nodes, rgw nodes, and the rest of the ceph cluster match
> perfectly and actually build off of the same ntp server.
>
> I tried disabling dynamic resharding for the RGW daemons by placing this
> in the ceph.conf for the affected daemons `rgw_dynamic_resharding = false`
> and restarting them as well as issuing a reshard cancel for the bucket, but
> nothing seems to actually stop the reshard from processing.  Here's the
> output of a few commands.  [2] reshard list [3] reshard status
>
> Are there any things we can do to actually disable bucket resharding or
> let it finish?  I'm stuck on ideas.  I've tried quite a few things I've
> found around except for manually resharding which is a last resort here.
> This bucket won't exist in a couple months and the performance is good
> enough without resharding, but I don't know how to get it to stop.  Thanks.
>
>
> [1] 2018-08-02 16:22:16.047387 7fbe82e61700  0 NOTICE: resharding
> operation on bucket index detected, blocking
> 2018-08-02 16:22:16.206950 7fbe8de77700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-08-02 16:22:12.253734 7fbe4f5fa700  0 NOTICE: request time skew too
> big now=2018-08-02 16:22:12.00 req_time=2018-08-02 16:06:03.00
>
> [2] $ radosgw-admin reshard list
> [2018-08-02 16:13:19.082172 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.10
> 2018-08-02 16:13:19.082757 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.11
> 2018-08-02 16:13:19.083941 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.12
> 2018-08-02 16:13:19.085170 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.13
> 2018-08-02 16:13:19.085898 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.14
> ]
> 2018-08-02 16:13:19.086476 7f3ca4163c80 -1 ERROR: failed to list reshard
> log entries, oid=reshard.15
>
> [3] $ radosgw-admin reshard status --bucket my-bucket
> [
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> {
> "reshard_status": 1,
> "new_bucket_instance_id":
> "b7567cda-7d6f-4feb-86d6-bbd9da36b14d.141873449.1",
> "num_shards": 32
> },
> 

Re: [ceph-users] FileStore SSD (journal) vs BlueStore SSD (DB/Wal)

2018-08-03 Thread Sam Huracan
Hi,

Anyone can help us answer these questions?



2018-08-03 8:36 GMT+07:00 Sam Huracan :

> Hi Cephers,
>
> We intend to upgrade our Cluster from Jewel to Luminous (or Mimic?)
>
> Our model is currently using OSD File Store with SSD Journal (1 SSD for 7
> SATA 7.2K)
>
> My question are:
>
>
> 1.Should we change to BlueStore with DB/WAL put in SSD and data in HDD?
> (we want to keep the model using journal SSD for caching). Is there any
> improvement in overall performance? We think with model SSD cache,
> FileStore will write faster because data written in SSD before flushing to
> SAS, whereas with BlueStore, data will be written directly to SAS.
>
>
> 2. Do you guys ever benchmark and compare 2 cluster: FileStore SSD
> (journal)  and BlueStore SSD (DB/WAL) like that?
>
>
> Thanks in advance.
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error: journal specified but not allowed by osd backend

2018-08-03 Thread David Majchrzak
Thanks Eugen!
I was looking into running all the commands manually, following the docs for 
add/remove osd but tried ceph-disk first.

I actually made it work by changing the id part in ceph-disk ( it was checking 
the wrong journal device, which was owned by root:root ). The next part was 
that I tried re-using an old journal, so I had to create a new one ( parted / 
sgdisk to set ceph-journal parttype). Could I have just zapped the previous 
journal?
After that it prepared successfully and starting peering. Unsetting nobackfill 
let it recover a 4TB HDD in approx 9 hours.
The best part was that I didn't have to backfill twice then, by reusing the osd 
uuid.
I'll see if I can add to the docs after we have updated to Luminous or Mimic 
and started using ceph-volume.

Kind Regards
David Majchrzak

On aug 3 2018, at 4:16 pm, Eugen Block  wrote:
>
> Hi,
> we have a full bluestore cluster and had to deal with read errors on
> the SSD for the block.db. Something like this helped us to recreate a
> pre-existing OSD without rebalancing, just refilling the PGs. I would
> zap the journal device and let it recreate. It's very similar to your
> ceph-deploy output, but maybe you get more of it if you run it manually:
>
> ceph-osd [--cluster-uuid ] [--osd-objectstore filestore]
> --mkfs -i  --osd-journal  --osd-data
> /var/lib/ceph/osd/ceph-/ --mkjournal --setuser ceph --setgroup
> ceph --osd-uuid 
>
> Maybe after zapping the journal this will work. At least it would rule
> out the old journal as the show-stopper.
>
> Regards,
> Eugen
>
>
> Zitat von David Majchrzak :
> > Hi!
> > Trying to replace an OSD on a Jewel cluster (filestore data on HDD +
> > journal device on SSD).
> > I've set noout and removed the flapping drive (read errors) and
> > replaced it with a new one.
> >
> > I've taken down the osd UUID to be able to prepare the new disk with
> > the same osd.ID. The journal device is the same as the previous one
> > (should I delete the partition and recreate it?)
> > However, running ceph-disk prepare returns:
> > # ceph-disk -v prepare --cluster-uuid
> > c51a2683-55dc-4634-9d9d-f0fec9a6f389 --osd-uuid
> > dc49691a-2950-4028-91ea-742ffc9ed63f --journal-dev --data-dev
> > --fs-type xfs /dev/sdo /dev/sda8
> > command: Running command: /usr/bin/ceph-osd --check-allows-journal
> > -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph
> > --setuser ceph --setgroup ceph
> > command: Running command: /usr/bin/ceph-osd --check-wants-journal -i
> > 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph
> > --setuser ceph --setgroup ceph
> > command: Running command: /usr/bin/ceph-osd --check-needs-journal -i
> > 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph
> > --setuser ceph --setgroup ceph
> > Traceback (most recent call last):
> > File "/usr/sbin/ceph-disk", line 9, in 
> > load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5371, in run
> > main(sys.argv[1:])
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5322, in 
> > main
> > args.func(args)
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 1900, in 
> > main
> > Prepare.factory(args).prepare()
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line
> > 1896, in factory
> > return PrepareFilestore(args)
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line
> > 1909, in __init__
> > self.journal = PrepareJournal(args)
> > File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line
> > 2221, in __init__
> > raise Error('journal specified but not allowed by osd backend')
> > ceph_disk.main.Error: Error: journal specified but not allowed by osd 
> > backend
> >
> > I tried googling first of course. It COULD be that we have set
> > setuser_match_path globally in ceph.conf (like this bug report:
> > https://tracker.ceph.com/issues/19642) since the cluster was created
> > as dumpling a long time ago.
> > Best practice to fix it? Create [osd.X] configs and set
> > setuser_match_path in there instead for the old OSDs?
> > Should I do any other steps preceding this if I want to use the same
> > osd UUID? I've only stopped ceph-osd@21, removed the physical disk,
> > inserted new one and tried running prepare.
> > Kind Regards,
> > David
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error: journal specified but not allowed by osd backend

2018-08-03 Thread Eugen Block

Hi,

we have a full bluestore cluster and had to deal with read errors on  
the SSD for the block.db. Something like this helped us to recreate a  
pre-existing OSD without rebalancing, just refilling the PGs. I would  
zap the journal device and let it recreate. It's very similar to your  
ceph-deploy output, but maybe you get more of it if you run it manually:


ceph-osd [--cluster-uuid ] [--osd-objectstore filestore]  
--mkfs -i  --osd-journal  --osd-data  
/var/lib/ceph/osd/ceph-/ --mkjournal --setuser ceph --setgroup  
ceph --osd-uuid 


Maybe after zapping the journal this will work. At least it would rule  
out the old journal as the show-stopper.


Regards,
Eugen


Zitat von David Majchrzak :


Hi!
Trying to replace an OSD on a Jewel cluster (filestore data on HDD +  
journal device on SSD).
I've set noout and removed the flapping drive (read errors) and  
replaced it with a new one.


I've taken down the osd UUID to be able to prepare the new disk with  
the same osd.ID. The journal device is the same as the previous one  
(should I delete the partition and recreate it?)

However, running ceph-disk prepare returns:
# ceph-disk -v prepare --cluster-uuid  
c51a2683-55dc-4634-9d9d-f0fec9a6f389 --osd-uuid  
dc49691a-2950-4028-91ea-742ffc9ed63f --journal-dev --data-dev  
--fs-type xfs /dev/sdo /dev/sda8
command: Running command: /usr/bin/ceph-osd --check-allows-journal  
-i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph  
--setuser ceph --setgroup ceph
command: Running command: /usr/bin/ceph-osd --check-wants-journal -i  
0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph  
--setuser ceph --setgroup ceph
command: Running command: /usr/bin/ceph-osd --check-needs-journal -i  
0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph  
--setuser ceph --setgroup ceph

Traceback (most recent call last):
File "/usr/sbin/ceph-disk", line 9, in 
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5371, in run
main(sys.argv[1:])
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5322, in main
args.func(args)
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 1900, in main
Prepare.factory(args).prepare()
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line  
1896, in factory

return PrepareFilestore(args)
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line  
1909, in __init__

self.journal = PrepareJournal(args)
File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line  
2221, in __init__

raise Error('journal specified but not allowed by osd backend')
ceph_disk.main.Error: Error: journal specified but not allowed by osd backend

I tried googling first of course. It COULD be that we have set  
setuser_match_path globally in ceph.conf (like this bug report:  
https://tracker.ceph.com/issues/19642) since the cluster was created  
as dumpling a long time ago.
Best practice to fix it? Create [osd.X] configs and set  
setuser_match_path in there instead for the old OSDs?
Should I do any other steps preceding this if I want to use the same  
osd UUID? I've only stopped ceph-osd@21, removed the physical disk,  
inserted new one and tried running prepare.

Kind Regards,
David




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck with active+undersized+degraded on Jewel after cluster maintenance

2018-08-03 Thread Pawel S
On Fri, Aug 3, 2018 at 2:07 PM Paweł Sadowsk  wrote:

> On 08/03/2018 01:45 PM, Pawel S wrote:
> > hello!
> >
> > We did maintenance works (cluster shrinking) on one cluster (jewel)
> > and after shutting one of osds down we found this situation where
> > recover of pg can't be started because of "querying" one of peers. We
> > restarted this OSD, tried to out and in. Nothing helped, finally we
> > moved out data (the pg was still on it) and removed this osd from
> > crush and whole cluster. But recover can't start on any other osd to
> > create this copy again. We still have valid active 2 copies, but we
> > would like to have it clean.
> > How we can push recover to have this third copy somewhere ?
> > Replication size is 3 on hosts and there are plenty of them.
> >
> > Status now:
> >health HEALTH_WARN
> > 1 pgs degraded
> > 1 pgs stuck degraded
> > 1 pgs stuck unclean
> > 1 pgs stuck undersized
> > 1 pgs undersized
> > recovery 268/19265130 objects degraded (0.001%)
> >
> > Link to PG query details, health status and version commit here:
> > https://gist.github.com/pejotes/aea71ecd2718dbb3ceab0e648924d06b
> Can you add 'ceph osd tree', 'ceph osd crush show-tunables' and 'ceph
> osd crush rule dump'? Looks like crush is not able to find place for 3rd
> copy due to big difference in weight of rack/host depending on your
> crush rules.
>
>
yes, you were right :-)

Quickly went through the alg and found it's simply don't have enough tries
(as a workaround) to handle this weight difference (I had 54, 115, 145) in
my failure domains. Increasing "choose_total_tries" to 100 did the trick.
Rules were set to choose on datacenter buckets created from racks and
hosts. Next step will be to balance weight of datacenter buckets to
equalize it a bit, couple of OSDs can be removed. :-)
Thank you Pawel!

best regards!
Pawel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS and hard links

2018-08-03 Thread Yan, Zheng
On Fri, Aug 3, 2018 at 8:53 PM Benjeman Meekhof  wrote:
>
> Thanks, that's useful to know.  I've pasted the output you asked for
> below, thanks for taking a look.
>
> Here's the output of dump_mempools:
>
> {
> "mempool": {
> "by_pool": {
> "bloom_filter": {
> "items": 4806709,
> "bytes": 4806709
> },
> "bluestore_alloc": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_cache_data": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_cache_onode": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_cache_other": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_fsck": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_txc": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_writing_deferred": {
> "items": 0,
> "bytes": 0
> },
> "bluestore_writing": {
> "items": 0,
> "bytes": 0
> },
> "bluefs": {
> "items": 0,
> "bytes": 0
> },
> "buffer_anon": {
> "items": 1303621,
> "bytes": 6643324694
> },
> "buffer_meta": {
> "items": 2397,
> "bytes": 153408
> },
> "osd": {
> "items": 0,
> "bytes": 0
> },
> "osd_mapbl": {
> "items": 0,
> "bytes": 0
> },
> "osd_pglog": {
> "items": 0,
> "bytes": 0
> },
> "osdmap": {
> "items": 8222,
> "bytes": 185840
> },
> "osdmap_mapping": {
> "items": 0,
> "bytes": 0
> },
> "pgmap": {
> "items": 0,
> "bytes": 0
> },
> "mds_co": {
> "items": 160660321,
> "bytes": 4080240182
> },
> "unittest_1": {
> "items": 0,
> "bytes": 0
> },
> "unittest_2": {
> "items": 0,
> "bytes": 0
> }
> },
> "total": {
> "items": 166781270,
> "bytes": 10728710833
> }
> }
> }
>
> and heap_stats:
>
> MALLOC:12418630040 (11843.3 MiB) Bytes in use by application
> MALLOC: +  1310720 (1.2 MiB) Bytes in page heap freelist
> MALLOC: +378986760 (  361.4 MiB) Bytes in central cache freelist
> MALLOC: +  4713472 (4.5 MiB) Bytes in transfer cache freelist
> MALLOC: + 20722016 (   19.8 MiB) Bytes in thread cache freelists
> MALLOC: + 62652416 (   59.8 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =  12887015424 (12290.0 MiB) Actual memory used (physical + swap)
> MALLOC: +309624832 (  295.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =  13196640256 (12585.3 MiB) Virtual address space used
> MALLOC:
> MALLOC: 921411  Spans in use
> MALLOC: 20  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
>

mds does use 4GB memory for its cache, but there are 6GB memory used
by bufferlist. It seems like there is memory leak. Could you try
simple messenger (add "ms type = simple" to 'global' section of
ceph.conf)


> On Wed, Aug 1, 2018 at 10:31 PM, Yan, Zheng  wrote:
> > On Thu, Aug 2, 2018 at 3:36 AM Benjeman Meekhof  wrote:
> >>
> >> I've been encountering lately a much higher than expected memory usage
> >> on our MDS which doesn't align with the cache_memory limit even
> >> accounting for potential over-runs.  Our memory limit is 4GB but the
> >> MDS process is steadily at around 11GB used.
> >>
> >> Coincidentally we also have a new user heavily relying on hard links.
> >> This led me to the following (old) document which says "Hard links are
> >> also supported, although in their current implementation each link
> >> requires a small bit of MDS memory and so there is an implied limit
> >> based on your available memory. "
> >> (https://ceph.com/geen-categorie/cephfs-mds-status-discussion/)
> >>
> >> Is that statement still correct, could it potentially explain why our
> >> memory usage appears so high?  As far as I know this is a recent
> >> development and it does very closely correspond to a new user doing a
> >> lot of hardlinking.  Ceph Mimic 13.2.1, though we first saw the issue
> >> while still running 13.2.0.
> >>
> >
> > That statement is no longer correct.   what are 

Re: [ceph-users] Ceph MDS and hard links

2018-08-03 Thread Benjeman Meekhof
Thanks, that's useful to know.  I've pasted the output you asked for
below, thanks for taking a look.

Here's the output of dump_mempools:

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 4806709,
"bytes": 4806709
},
"bluestore_alloc": {
"items": 0,
"bytes": 0
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 0,
"bytes": 0
},
"bluestore_cache_other": {
"items": 0,
"bytes": 0
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 0,
"bytes": 0
},
"buffer_anon": {
"items": 1303621,
"bytes": 6643324694
},
"buffer_meta": {
"items": 2397,
"bytes": 153408
},
"osd": {
"items": 0,
"bytes": 0
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 0,
"bytes": 0
},
"osdmap": {
"items": 8222,
"bytes": 185840
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 160660321,
"bytes": 4080240182
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 166781270,
"bytes": 10728710833
}
}
}

and heap_stats:

MALLOC:12418630040 (11843.3 MiB) Bytes in use by application
MALLOC: +  1310720 (1.2 MiB) Bytes in page heap freelist
MALLOC: +378986760 (  361.4 MiB) Bytes in central cache freelist
MALLOC: +  4713472 (4.5 MiB) Bytes in transfer cache freelist
MALLOC: + 20722016 (   19.8 MiB) Bytes in thread cache freelists
MALLOC: + 62652416 (   59.8 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  12887015424 (12290.0 MiB) Actual memory used (physical + swap)
MALLOC: +309624832 (  295.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  13196640256 (12585.3 MiB) Virtual address space used
MALLOC:
MALLOC: 921411  Spans in use
MALLOC: 20  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

On Wed, Aug 1, 2018 at 10:31 PM, Yan, Zheng  wrote:
> On Thu, Aug 2, 2018 at 3:36 AM Benjeman Meekhof  wrote:
>>
>> I've been encountering lately a much higher than expected memory usage
>> on our MDS which doesn't align with the cache_memory limit even
>> accounting for potential over-runs.  Our memory limit is 4GB but the
>> MDS process is steadily at around 11GB used.
>>
>> Coincidentally we also have a new user heavily relying on hard links.
>> This led me to the following (old) document which says "Hard links are
>> also supported, although in their current implementation each link
>> requires a small bit of MDS memory and so there is an implied limit
>> based on your available memory. "
>> (https://ceph.com/geen-categorie/cephfs-mds-status-discussion/)
>>
>> Is that statement still correct, could it potentially explain why our
>> memory usage appears so high?  As far as I know this is a recent
>> development and it does very closely correspond to a new user doing a
>> lot of hardlinking.  Ceph Mimic 13.2.1, though we first saw the issue
>> while still running 13.2.0.
>>
>
> That statement is no longer correct.   what are output of  "ceph
> daemon mds.x dump_mempools" and "ceph tell mds.x heap stats"?
>
>
>> thanks,
>> Ben
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck with active+undersized+degraded on Jewel after cluster maintenance

2018-08-03 Thread Paweł Sadowsk
On 08/03/2018 01:45 PM, Pawel S wrote:
> hello!
>
> We did maintenance works (cluster shrinking) on one cluster (jewel)
> and after shutting one of osds down we found this situation where
> recover of pg can't be started because of "querying" one of peers. We
> restarted this OSD, tried to out and in. Nothing helped, finally we
> moved out data (the pg was still on it) and removed this osd from
> crush and whole cluster. But recover can't start on any other osd to
> create this copy again. We still have valid active 2 copies, but we
> would like to have it clean. 
> How we can push recover to have this third copy somewhere ?
> Replication size is 3 on hosts and there are plenty of them.  
>
> Status now: 
>    health HEALTH_WARN
>             1 pgs degraded
>             1 pgs stuck degraded
>             1 pgs stuck unclean
>             1 pgs stuck undersized
>             1 pgs undersized
>             recovery 268/19265130 objects degraded (0.001%)
>
> Link to PG query details, health status and version commit here:
> https://gist.github.com/pejotes/aea71ecd2718dbb3ceab0e648924d06b
Can you add 'ceph osd tree', 'ceph osd crush show-tunables' and 'ceph
osd crush rule dump'? Looks like crush is not able to find place for 3rd
copy due to big difference in weight of rack/host depending on your
crush rules.

-- 
PS
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] stuck with active+undersized+degraded on Jewel after cluster maintenance

2018-08-03 Thread Pawel S
hello!

We did maintenance works (cluster shrinking) on one cluster (jewel) and
after shutting one of osds down we found this situation where recover of pg
can't be started because of "querying" one of peers. We restarted this OSD,
tried to out and in. Nothing helped, finally we moved out data (the pg was
still on it) and removed this osd from crush and whole cluster. But recover
can't start on any other osd to create this copy again. We still have valid
active 2 copies, but we would like to have it clean.
How we can push recover to have this third copy somewhere ? Replication
size is 3 on hosts and there are plenty of them.

Status now:
   health HEALTH_WARN
1 pgs degraded
1 pgs stuck degraded
1 pgs stuck unclean
1 pgs stuck undersized
1 pgs undersized
recovery 268/19265130 objects degraded (0.001%)

Link to PG query details, health status and version commit here:
https://gist.github.com/pejotes/aea71ecd2718dbb3ceab0e648924d06b

best regards!
-- 
Pawel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs meta data pool to ssd and measuring performance difference

2018-08-03 Thread Linh Vu
Try IOR mdtest for metadata performance.


From: ceph-users  on behalf of Marc Roos 

Sent: Friday, 3 August 2018 7:49:13 PM
To: dcsysengineer
Cc: ceph-users
Subject: Re: [ceph-users] Cephfs meta data pool to ssd and measuring 
performance difference


I have moved the pool, but strange thing is that if I do something like
this

for object in `cat out`; do rados -p fs_meta get $object /dev/null ;
done

I do not see any activity on the ssd drives with something like dstat
(checked on all nodes (sdh))

net/eth4.60-net/eth4.52
--dsk/sda-dsk/sdb-dsk/sdc-dsk/sdd-dsk/sde-dsk/sdf---
--dsk/sdg-dsk/sdh-dsk/sdi--
recv send: recv send| read writ: read writ: read writ: read writ:
read writ: read writ: read writ: read writ: read writ
0 0 : 0 0 | 415B 201k:3264k 1358k:4708k 3487k:5779k 3046k:
13M 6886k:9055B 487k:1118k 836k: 327k 210k:3497k 1569k
154k 154k: 242k 336k| 0 0 : 0 0 : 0 12k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
103k 92k: 147k 199k| 0 164k: 0 0 : 0 32k: 0 0
:8192B 12k: 0 0 : 0 0 : 0 0 : 0 0
96k 124k: 108k 96k| 0 4096B: 0 0 :4096B 20k: 0 0
:4096B 12k: 0 8192B: 0 0 : 0 0 : 0 0
175k 375k: 330k 266k| 0 69k: 0 0 : 0 0 : 0 0 :
0 0 :8192B 136k: 0 0 : 0 0 : 0 0
133k 102k: 124k 103k| 0 0 : 0 0 : 0 76k: 0 0 :
0 32k: 0 0 : 0 0 : 0 0 : 0 0
350k 185k: 318k 1721k| 0 57k: 0 0 : 0 16k: 0 0 :
0 36k:1416k 0 : 0 0 : 0 144k: 0 0
206k 135k: 164k 797k| 0 0 : 0 0 :8192B 44k: 0 0 :
0 28k: 660k 0 : 0 0 :4096B 260k: 0 0
138k 136k: 252k 273k| 0 51k: 0 0 :4096B 16k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
158k 117k: 436k 369k| 0 0 : 0 0 : 0 0 : 0 20k:
0 0 :4096B 20k: 0 20k: 0 0 : 0 0
146k 106k: 327k 988k| 0 63k: 0 16k: 0 52k: 0 0 :
0 0 : 0 52k: 0 0 : 0 0 : 0 0
77k 74k: 361k 145k| 0 0 : 0 0 : 0 16k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
186k 149k: 417k 824k| 0 51k: 0 0 : 0 28k: 0 0 :
0 28k: 0 0 : 0 0 : 0 36k: 0 0

But this showed some activity

[@c01 ~]# ceph osd pool stats | grep fs_meta -A 2
pool fs_meta id 19
client io 0 B/s rd, 17 op/s rd, 0 op/s wr

I took maybe around 20h to move the fs_meta pool (only 200MB, 2483328
objects) from hdd to ssd, also maybe because of some other remapping of
one replaced and one added hdd. (I have slow hdd's)

I did not manage to do a good test, because the results seem to be
similar as before the move. I did not want to create files because I
thought it would include the fs_data pool to much, which is on my slow
hdd's. So I did the readdir and stats tests.

I checked if mds.a was active, limited the cache of mds.a to 1000 inodes
(I think) with:
ceph daemon mds.a config set mds_cache_size 1000 ()

Flushed caches on the nodes with:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free

And ran these tests:
python ../smallfile-master/smallfile_cli.py --operation stat --threads 1
--file-size 128 --files-per-dir 5 --files 50 --top
/home/backup/test/kernel/
python ../smallfile-master/smallfile_cli.py --operation readdir
--threads 1 --file-size 128 --files-per-dir 5 --files 50 --top
/home/backup/test/kernel/

Maybe this is helpful in selecting a better test for your move.


-Original Message-
From: David C [mailto:dcsysengin...@gmail.com]
Sent: maandag 30 juli 2018 14:23
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Cephfs meta data pool to ssd and measuring
performance difference

Something like smallfile perhaps? https://github.com/bengland2/smallfile

Or you just time creating/reading lots of files

With read benching you would want to ensure you've cleared your mds
cache or use a dataset larger than the cache.

I'd be interested in seeing your results, I this on the to do list
myself.

On 25 Jul 2018 15:18, "Marc Roos"  wrote:




>From this thread, I got how to move the meta data pool from the
hdd's to
the ssd's.
https://www.spinics.net/lists/ceph-users/msg39498.html

ceph osd pool get fs_meta crush_rule
ceph osd pool set fs_meta crush_rule replicated_ruleset_ssd

I guess this can be done on a live system?

What would be a good test to show the performance difference
between the
old hdd and the new ssd?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-08-03 Thread Alexandru Cucu
Hello,

Another OSD started randomly crashing with segmentation fault. Haven't
managed to add the last 3 OSDs back to the cluster as the daemons keep
crashing.

---

-2> 2018-08-03 12:12:52.670076 7f12b6b15700  4 rocksdb:
EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event":
"table_file_deletion", "file_number": 4350}
-1> 2018-08-03 12:12:53.146753 7f12c38d0a80  0 osd.154 89917 load_pgs
 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f12c38d0a80 thread_name:ceph-osd
 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (()+0x9f1c2a) [0x7f12c42ddc2a]
 2: (()+0xf5e0) [0x7f12c1dc85e0]
 3: (()+0x34484) [0x7f12c34a6484]
 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions
const&, rocksdb::BlockIter*,
rocksdb::BlockBasedTable::CachableEntry*)+0x466)
[0x7f12c41e40d6]
 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297)
[0x7f12c41e4b27]
 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*,
bool
, int)+0x2a4) [0x7f12c429ff94]
 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*,
unsigned l
ong*)+0x810) [0x7f12c419bb80]
 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494]
 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19]
 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
std::string*)+0x95) [0x7f12c4252a45]
 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice
const&, std::string*)+0x4a) [0x7f12c4251eea]
 12: (RocksDBStore::get(std::string const&, std::string const&,
ceph::buffer::list*)+0xff) [0x7f12c415c31f]
 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock
const&, ghobject_t const&)+0x5e4) [0x7f12c4110814]
 14: (DBObjectMap::get_values(ghobject_t const&, std::set, std::allocator > const&,
std::map,
std::
allocator > >*)+0x5f)
[0x7f12c41f]
 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&,
std::set,
std::allocator > const&, std::map, std::allocator > >*)+0x197) [0x7f12c4031f77]
 16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1]
 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5]
 18: (OSD::init()+0x2086) [0x7f12c3d07096]
 19: (main()+0x2c18) [0x7f12c3c1e088]
 20: (__libc_start_main()+0xf5) [0x7f12c0374c05]
 21: (()+0x3c8847) [0x7f12c3cb4847]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
---

Any help would be appreciated.

Thanks,
Alex Cucu

On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu  wrote:
>
> Hello Ceph users,
>
> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
> the update, 1 OSD crashed.
> When trying to add the OSD back to the cluster, other 2 OSDs started
> crashing with segmentation fault. Had to mark all 3 OSDs as down as we
> had stuck PGs and blocked operations and the cluster status was
> HEALTH_ERR.
>
> We have tried various ways to re-add the OSDs back to the cluster but
> after a while they start crashing and won't start anymore. After a
> while they can be started again and marked as in but after some
> rebalancing they will start the crashing imediately after starting.
>
> Here are some logs:
> https://pastebin.com/nCRamgRU
>
> Do you know of any existing bug report that might be related? (I
> couldn't find anything).
>
> I will happily provide any information that would help solving this issue.
>
> Thank you,
> Alex Cucu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs meta data pool to ssd and measuring performance difference

2018-08-03 Thread Marc Roos
 
I have moved the pool, but strange thing is that if I do something like 
this

for object in `cat out`; do rados -p fs_meta get $object /dev/null  ; 
done

I do not see any activity on the ssd drives with something like dstat 
(checked on all nodes (sdh))

net/eth4.60-net/eth4.52 
--dsk/sda-dsk/sdb-dsk/sdc-dsk/sdd-dsk/sde-dsk/sdf---
--dsk/sdg-dsk/sdh-dsk/sdi--
 recv  send: recv  send| read  writ: read  writ: read  writ: read  writ: 
read  writ: read  writ: read  writ: read  writ: read  writ
   0 0 :   0 0 | 415B  201k:3264k 1358k:4708k 3487k:5779k 3046k: 
 13M 6886k:9055B  487k:1118k  836k: 327k  210k:3497k 1569k
 154k  154k: 242k  336k|   0 0 :   0 0 :   012k:   0 0 : 
  0 0 :   0 0 :   0 0 :   0 0 :   0 0
 103k   92k: 147k  199k|   0   164k:   0 0 :   032k:   0 0 
:8192B   12k:   0 0 :   0 0 :   0 0 :   0 0
  96k  124k: 108k   96k|   0  4096B:   0 0 :4096B   20k:   0 0 
:4096B   12k:   0  8192B:   0 0 :   0 0 :   0 0
 175k  375k: 330k  266k|   069k:   0 0 :   0 0 :   0 0 : 
  0 0 :8192B  136k:   0 0 :   0 0 :   0 0
 133k  102k: 124k  103k|   0 0 :   0 0 :   076k:   0 0 : 
  032k:   0 0 :   0 0 :   0 0 :   0 0
 350k  185k: 318k 1721k|   057k:   0 0 :   016k:   0 0 : 
  036k:1416k0 :   0 0 :   0   144k:   0 0
 206k  135k: 164k  797k|   0 0 :   0 0 :8192B   44k:   0 0 : 
  028k: 660k0 :   0 0 :4096B  260k:   0 0
 138k  136k: 252k  273k|   051k:   0 0 :4096B   16k:   0 0 : 
  0 0 :   0 0 :   0 0 :   0 0 :   0 0
 158k  117k: 436k  369k|   0 0 :   0 0 :   0 0 :   020k: 
  0 0 :4096B   20k:   020k:   0 0 :   0 0
 146k  106k: 327k  988k|   063k:   016k:   052k:   0 0 : 
  0 0 :   052k:   0 0 :   0 0 :   0 0
  77k   74k: 361k  145k|   0 0 :   0 0 :   016k:   0 0 : 
  0 0 :   0 0 :   0 0 :   0 0 :   0 0
 186k  149k: 417k  824k|   051k:   0 0 :   028k:   0 0 : 
  028k:   0 0 :   0 0 :   036k:   0 0 

But this showed some activity

[@c01 ~]# ceph osd pool stats | grep fs_meta -A 2
pool fs_meta id 19
  client io 0 B/s rd, 17 op/s rd, 0 op/s wr

I took maybe around 20h to move the fs_meta pool (only 200MB, 2483328 
objects) from hdd to ssd, also maybe because of some other remapping of 
one replaced and one added hdd. (I have slow hdd's) 

I did not manage to do a good test, because the results seem to be 
similar as before the move. I did not want to create files because I 
thought it would include the fs_data pool to much, which is on my slow 
hdd's. So I did the readdir and stats tests. 

I checked if mds.a was active, limited the cache of mds.a to 1000 inodes 
(I think) with:
ceph daemon mds.a config set mds_cache_size  1000 ()

Flushed caches on the nodes with: 
free && sync && echo 3 > /proc/sys/vm/drop_caches && free

And ran these tests:
python ../smallfile-master/smallfile_cli.py --operation stat --threads 1 
--file-size 128 --files-per-dir 5 --files 50 --top 
/home/backup/test/kernel/
python ../smallfile-master/smallfile_cli.py --operation readdir 
--threads 1 --file-size 128 --files-per-dir 5 --files 50 --top 
/home/backup/test/kernel/

Maybe this is helpful in selecting a better test for your move.


-Original Message-
From: David C [mailto:dcsysengin...@gmail.com] 
Sent: maandag 30 juli 2018 14:23
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Cephfs meta data pool to ssd and measuring 
performance difference

Something like smallfile perhaps? https://github.com/bengland2/smallfile

Or you just time creating/reading lots of files

With read benching you would want to ensure you've cleared your mds 
cache or use a dataset larger than the cache.

I'd be interested in seeing your results, I this on the to do list 
myself. 

On 25 Jul 2018 15:18, "Marc Roos"  wrote:


 

From this thread, I got how to move the meta data pool from the 
hdd's to 
the ssd's.
https://www.spinics.net/lists/ceph-users/msg39498.html

ceph osd pool get fs_meta crush_rule
ceph osd pool set fs_meta crush_rule replicated_ruleset_ssd

I guess this can be done on a live system?

What would be a good test to show the performance difference 
between the 
old hdd and the new ssd?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange OSD crash starts other osd flapping

2018-08-03 Thread Daznis
Hello,


Yesterday I have encountered a strange osd crash which led to cluster
flapping. I had to force nodown flag on the cluster to finish the
flapping. The first osd that crashed with:

2018-08-02 17:23:23.275417 7f87ec8d7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8803dfb700' had timed out after 15
2018-08-02 17:23:23.275425 7f87ec8d7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8805dff700' had timed out after 15

2018-08-02 17:25:38.902142 7f8829df0700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8803dfb700' had suicide timed out after 150
2018-08-02 17:25:38.907199 7f8829df0700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const
ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8829df0700
time 2018-08-02 17:25:38.902354
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x55872911fb65]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x2e1) [0x55872905e8f1]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x55872905f14e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55872905f92c]
 5: (CephContextServiceThread::entry()+0x15b) [0x55872913790b]
 6: (()+0x7e25) [0x7f882dc71e25]
 7: (clone()+0x6d) [0x7f882c2f8bad]


Then other osds started restarting with messages like this:

2018-08-02 17:37:14.859272 7f4bd31fe700  0 osd.44 184343
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
600.00 seconds, shutting down
2018-08-02 17:37:14.870121 7f4bd31fe700  0 osd.44 184343
_committed_osd_maps shutdown OSD via async signal
2018-08-02 17:37:14.870159 7f4bb9618700 -1 osd.44 184343 *** Got
signal Interrupt ***
2018-08-02 17:37:14.870167 7f4bb9618700  0 osd.44 184343
prepare_to_stop starting shutdown

There is a 10k line event dump with the first osd crash. I have looked
thru it  and nothing strange stuck with me. Any suggestions what I
should be looking for in it? I have checked nodes dmesg and switch
port logs. No info on flapping ports or  interface and completely no
errors with disk.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write operation to cephFS mount hangs

2018-08-03 Thread Eugen Block

Hi,

I send the logfile in the attachment. I can find no error messages  
or anything problematic…


I didn't see any log file attached to the email.

Another question: Is there a link between the VMs that fail to write  
to CephFS and the hypervisors? Are all failing clients on the same  
hypervisor(s)? If yes, have there been updates or any configuration  
changes?


Regards
Eugen


Zitat von Bödefeld Sabine :


Hello Gregory



Yes I did that…

I’ve started the ceph-fuse client manually with ceph-fuse -d  
--debug-client=20 --debug-ms=1 --debug-monc=20 -m 192.168.1.161:6789  
/mnt/cephfs


and then started the copy process. I send the logfile in the  
attachment. I can find no error messages or anything problematic…




Kind regards

Sabine



  _

Dr. sc. techn. ETH
Sabine Bödefeld
Senior Consultant

ECOSPEED AG
Drahtzugstrasse 18
CH-8008 Zürich
Tel. +41 44 388 95 04
Fax: +41 44 388 95 09
Web: http://www.ecospeed.ch 

trusted-development-72x72



Von: Gregory Farnum [mailto:gfar...@redhat.com]
Gesendet: Donnerstag, 2. August 2018 09:36
An: Bödefeld Sabine
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Write operation to cephFS mount hangs



Make sure you’ve gone through the suggestions at  
http://docs.ceph.com/docs/master/cephfs/troubleshooting/


On Thu, Aug 2, 2018 at 12:39 PM Bödefeld Sabine  
 wrote:


Hi Gregory



Yes they have the same key and permissions. ceph auth list on the  
mds server gives




client.admin

   key: 

caps: [mds] allow *

caps: [mon] allow *

caps: [osd] allow *



On the client /etc/ceph/ceph.client.admin.keyring

[client.admin]

key = 

caps mds = "allow *"

caps mon = "allow *"

caps osd = "allow *"



I checked that the key is identical, also that the keyring is the  
same on both the clients that work and on the clients where the  
write operations fail.


Do you have any other suggestions?
Kind regards

Sabine




  _


Dr. sc. techn. ETH
Sabine Bödefeld
Senior Consultant

ECOSPEED AG
Drahtzugstrasse 18  

CH-8008 Zürich  


Tel. +41 44 388 95 04
Fax: +41 44 388 95 09
Web: http://www.ecospeed.ch 

trusted-development-72x72



Von: Gregory Farnum [mailto:gfar...@redhat.com]
Gesendet: Mittwoch, 1. August 2018 06:10
An: Bödefeld Sabine
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Write operation to cephFS mount hangs





On Tue, Jul 31, 2018 at 7:46 PM Bödefeld Sabine  
 wrote:


Hello,



we have a Ceph Cluster 10.2.10 on VMs with Ubuntu 16.04 using Xen as  
the hypervisor. We use CephFS and the clients use ceph-fuse to  
access the files.


Some of the ceph-fuse clients hang on write operations to the  
cephFS. On copying a file to the cephFS, the file is created but  
it's empty and the write operation hangs forever. Ceph-fuse version  
is 10.2.9.




Sounds like the client has the MDS permissions required to update  
the CephFS metadata hierarchy, but lacks permission to write to the  
RADOS pools which actually store the file data. What permissions do  
the clients have? Have you checked with "ceph auth list" or similar  
to make sure they all have the same CephX capabilities?


-Greg



In the logfile of the mds there are no error messages. Also, ceph  
health returns HEALTH_OK.


ceph daemon mds.eco61 session ls reports no problems (if I interpret  
correctly):


   {

"id": 64396,

"num_leases": 2,

"num_caps": 32,

"state": "open",

"replay_requests": 0,

"completed_requests": 1,

"reconnecting": false,

"inst": "client.64396 192.168.1.179:0\/980852091",

"client_metadata": {

"ceph_sha1": "2ee413f77150c0f375ff6f10edd6c8f9c7d060d0",

"ceph_version": "ceph version 10.2.9  
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)",


"entity_id": "admin",

"hostname": "eco79",

"mount_point": "\/mnt\/cephfs",

"root": "\/"

}

},



Does anyone have an idea where the problem lies? Any help would be  
greatly appreciated.


Thanks very much,

Kind regards

Sabine

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] download.ceph.com repository changes

2018-08-03 Thread Fabian Grünbichler
On Mon, Jul 30, 2018 at 11:36:55AM -0600, Ken Dreyer wrote:
> On Fri, Jul 27, 2018 at 1:28 AM, Fabian Grünbichler
>  wrote:
> > On Tue, Jul 24, 2018 at 10:38:43AM -0400, Alfredo Deza wrote:
> >> Hi all,
> >>
> >> After the 12.2.6 release went out, we've been thinking on better ways
> >> to remove a version from our repositories to prevent users from
> >> upgrading/installing a known bad release.
> >>
> >> The way our repos are structured today means every single version of
> >> the release is included in the repository. That is, for Luminous,
> >> every 12.x.x version of the binaries is in the same repo. This is true
> >> for both RPM and DEB repositories.
> >>
> >> However, the DEB repos don't allow pinning to a given version because
> >> our tooling (namely reprepro) doesn't construct the repositories in a
> >> way that this is allowed. For RPM repos this is fine, and version
> >> pinning works.
> >
> > If you mean that reprepo does not support referencing multiple versions
> > of packages in the Packages file, there is a patched fork that does
> > (that seems well-supported):
> >
> > https://github.com/profitbricks/reprepro
> 
> Thanks for this link. That's great to know someone's working on this.
> 
> What's the status of merging that back into the main reprepro code, or
> else shipping that fork as the new reprepro package in Debian /
> Ubuntu? The Ceph project could end up responsible for maintaining that
> reprepro fork if the main Ubuntu community does not pick it up :) The
> fork is several years old, and the latest update on
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=570623 was over a
> year ago.

I don't know anything more than what is publicly available about either
merging back to the original reprepo or shipping in Debian/Ubuntu. We
are using our own custom repo software built around lower level tools, I
was just aware of the fork for unrelated reasons :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com