Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs

2018-03-08 Thread shadow_lin
Thanks for your advice.
I will try to reweight osds of my cluster.

Why ceph is so sensitive to unblanced pg distribution during high load? ceph 
osd df result is: https://pastebin.com/ur4Q9jsA.  ceph osd perf result is: 
https://pastebin.com/87DitPhV

There is no osd with very high pg count compare to others. When the wirte test 
load is low everything seems fine, but during high write load test, some of the 
osds with higher pg can have 3-10 time of fs_apply_latency compare to others. 

My guess is the high loaded osds kinda slowed the whole cluster(because I have 
only one pool with all osds)to the level of how fast they can handle. So other 
osd has lower load and have a good latency.

Is this expected during high load(Indicate the load is too hight for current 
cluster to hanlde)? 

How does luminous solve the unevenly pg distribution problem?I read about there 
is a pg-upmap exception table in the osdmap in luminous 12.2.x. It is said to 
use this it is possible to achive perfect pg distribution among osds.

2018-03-09 

shadow_lin 



发件人:David Turner 
发送时间:2018-03-09 06:45
主题:Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds 
with more pgs
收件人:"shadow_lin"
抄送:"ceph-users"

PGs being unevenly distributed is a common occurrence in Ceph.  Luminous 
started making some steps towards correcting this, but you're in Jewel.  There 
are a lot of threads in the ML archives about fixing PG distribution.  
Generally every method comes down to increasing the weight on OSDs with too few 
PGs and decreasing the weight on the OSDs with too many PGs.  There are a lot 
of schools of thought on the best way to implement this in your environment 
which has everything to do with your client IO patterns and workloads.  Looking 
into `ceph osd reweight-by-pg` might be a good place for you to start as you 
are only looking at 1 pool in your cluster.  If you have more pools, you 
generally need `ceph osd reweight-by-utilization`.


On Wed, Mar 7, 2018 at 8:19 AM shadow_lin  wrote:

Hi list,
   Ceph version is jewel 10.2.10 and all osd are using filestore.
The Cluster has 96 osds and 1 pool with size=2 replication with 4096 pg(base on 
pg calculate method from ceph doc for 100pg/per osd).
The osd with the most pg count has 104 PGs and there are 6 osds have above 100 
PGs
Most of the osd have around 7x-9x PGs
The osd with the least pg count has 58 PGs

During the write test some of the osds have very high fs_apply_latency like 
1000ms-4000ms while the normal ones are like 100-600ms. The osds with high 
latency are always the ones with more pg on it.

iostat on the high latency osd shows the hdds are having high %util at about 
95%-96% while the normal ones are having %util at 40%-60%

I think the reason to cause this is because the osds have more pgs need to 
handle more write request to it.Is this right?
But even though the pg distribution is not even but the variation is not that 
much.How could the performance be so sensitive to it?

Is there anything I can do to improve the performance and reduce the latency?

How can I make the pg distribution to be more even?

Thanks


2018-03-07



shadowlin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with UID starting with underscores

2018-03-08 Thread Konstantin Shalygin

because one our script misbehaved, new user with bad UID was created via
API, and now we can't remove, view or modify it. I believe, it's because it
has three underscores at the beginning:



Same problem here 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024578.html


You should search for issue on tracker, or create it.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread 赵贺东
Thank you for your suggestions.
We will upgrade ubuntu distro and linux kernel to see if the problem still 
exists or not.

> 在 2018年3月8日,下午5:51,Brad Hubbard  写道:
> 
> On Thu, Mar 8, 2018 at 7:33 PM, 赵赵贺东  > wrote:
>> Hi Brad,
>> 
>> Thank you for your attention.
>> 
>>> 在 2018年3月8日,下午4:47,Brad Hubbard  写道:
>>> 
>>> On Thu, Mar 8, 2018 at 5:01 PM, 赵贺东  wrote:
 Hi All,
 
 Every time after we activate osd, we got “Structure needs cleaning” in 
 /var/lib/ceph/osd/ceph-xxx/current/meta.
 
 
 /var/lib/ceph/osd/ceph-xxx/current/meta
 # ls -l
 ls: reading directory .: Structure needs cleaning
 total 0
 
 Could Anyone say something about this error?
>>> 
>>> It's an indication of possible corruption on the filesystem containing 
>>> "meta".
>>> 
>>> Can you unmount it and run a filesystem check on it?
>> I did some xfs_repair operation, but no effect.Structure needs cleaning” 
>> still exist.
>> 
>> 
>> 
>>> 
>>> At the time the filesystem first detected the corruption it would have
>>> logged it to dmesg and possibly syslog which may give you a clue. Did
>>> you lose power or have a kernel panic or something?
>> We did not lose power.
>> You are right, we get a metadata corruption in dmesg every time just 
>> following the osd activating operation.
>> 
>> [  399.513525] XFS (sda1): Metadata corruption detected at 
>> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
>> [  399.524709] XFS (sda1): Unmount and run xfs_repair
>> [  399.529511] XFS (sda1): First 64 bytes of corrupted metadata buffer:
>> [  399.535917] dd8f2000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
>> XFSB.s..
>> [  399.543959] dd8f2010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> [  399.551983] dd8f2020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
>> .0@"Q.O..sV.q..$
>> [  399.560037] dd8f2030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
>> 
>> [  399.568118] XFS (sda1): metadata I/O error: block 0x48b9ff80 
>> ("xfs_trans_read_buf_map") error 117 numblks 8
>> [  399.583179] XFS (sda1): Metadata corruption detected at 
>> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
>> [  399.594378] XFS (sda1): Unmount and run xfs_repair
>> [  399.599182] XFS (sda1): First 64 bytes of corrupted metadata buffer:
>> [  399.605575] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
>> XFSB.s..
>> [  399.613613] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> [  399.621637] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
>> .0@"Q.O..sV.q..$
>> [  399.629679] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
>> 
>> [  399.637856] XFS (sda1): metadata I/O error: block 0x48b9ff80 
>> ("xfs_trans_read_buf_map") error 117 numblks 8
>> [  399.648165] XFS (sda1): Metadata corruption detected at 
>> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
>> [  399.659378] XFS (sda1): Unmount and run xfs_repair
>> [  399.664196] XFS (sda1): First 64 bytes of corrupted metadata buffer:
>> [  399.670570] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
>> XFSB.s..
>> [  399.678610] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> [  399.686643] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
>> .0@"Q.O..sV.q..$
>> [  399.694681] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
>> 
>> [  399.702794] XFS (sda1): metadata I/O error: block 0x48b9ff80 
>> ("xfs_trans_read_buf_map") error 117 numblks 8
> 
> I'd suggest the next step is to look for a matching XFS bug in your
> distro and, if possible, try a different distro and see if you get the
> same result.
> 
>> 
>> 
>> Thank you !
>> 
>> 
>>> 
 
 Thank you!
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> 
>>> --
>>> Cheers,
>>> Brad
>> 
> 
> 
> 
> -- 
> Cheers,
> Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-08 Thread Matt Benjamin
Hi Yehuda,

I did add support for logging arbitrary headers, but not a
configurable log record a-la webservers.  To level set, David, are you
speaking about a file or pipe log sync on the RGW host?

Matt

On Thu, Mar 8, 2018 at 7:55 PM, Yehuda Sadeh-Weinraub  wrote:
> On Thu, Mar 8, 2018 at 2:22 PM, David Turner  wrote:
>> I remember some time ago Yehuda had commented on a thread like this saying
>> that it would make sense to add a logging/auditing feature like this to RGW.
>> I haven't heard much about it since then, though.  Yehuda, do you remember
>> that and/or think that logging like this might become viable.
>
> I vaguely remember Matt was working on this. Matt?
>
> Yehuda
>
>>
>>
>> On Thu, Mar 8, 2018 at 4:17 PM Aaron Bassett 
>> wrote:
>>>
>>> Yea thats what I was afraid of. I'm looking at possibly patching to add
>>> it, but i really dont want to support my own builds. I suppose other
>>> alternatives are to use proxies to log stuff, but that makes me sad.
>>>
>>> Aaron
>>>
>>>
>>> On Mar 8, 2018, at 12:36 PM, David Turner  wrote:
>>>
>>> Setting radosgw debug logging to 10/10 is the only way I've been able to
>>> get the access key in the logs for requests.  It's very unfortunate as it
>>> DRASTICALLY increases the amount of log per request, but it's what we needed
>>> to do to be able to have the access key in the logs along with the request.
>>>
>>> On Tue, Mar 6, 2018 at 3:09 PM Aaron Bassett 
>>> wrote:

 Hey all,
 I'm trying to get something of an audit log out of radosgw. To that end I
 was wondering if theres a mechanism to customize the log format of 
 civetweb.
 It's already writing IP, HTTP Verb, path, response and time, but I'm hoping
 to get it to print the Authorization header of the request, which 
 containers
 the access key id which we can tie back into the systems we use to issue
 credentials. Any thoughts?

 Thanks,
 Aaron
 CONFIDENTIALITY NOTICE
 This e-mail message and any attachments are only for the use of the
 intended recipient and may contain information that is privileged,
 confidential or exempt from disclosure under applicable law. If you are not
 the intended recipient, any disclosure, distribution or other use of this
 e-mail message or attachments is prohibited. If you have received this
 e-mail message in error, please delete and notify the sender immediately.
 Thank you.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-08 Thread Yehuda Sadeh-Weinraub
On Thu, Mar 8, 2018 at 2:22 PM, David Turner  wrote:
> I remember some time ago Yehuda had commented on a thread like this saying
> that it would make sense to add a logging/auditing feature like this to RGW.
> I haven't heard much about it since then, though.  Yehuda, do you remember
> that and/or think that logging like this might become viable.

I vaguely remember Matt was working on this. Matt?

Yehuda

>
>
> On Thu, Mar 8, 2018 at 4:17 PM Aaron Bassett 
> wrote:
>>
>> Yea thats what I was afraid of. I'm looking at possibly patching to add
>> it, but i really dont want to support my own builds. I suppose other
>> alternatives are to use proxies to log stuff, but that makes me sad.
>>
>> Aaron
>>
>>
>> On Mar 8, 2018, at 12:36 PM, David Turner  wrote:
>>
>> Setting radosgw debug logging to 10/10 is the only way I've been able to
>> get the access key in the logs for requests.  It's very unfortunate as it
>> DRASTICALLY increases the amount of log per request, but it's what we needed
>> to do to be able to have the access key in the logs along with the request.
>>
>> On Tue, Mar 6, 2018 at 3:09 PM Aaron Bassett 
>> wrote:
>>>
>>> Hey all,
>>> I'm trying to get something of an audit log out of radosgw. To that end I
>>> was wondering if theres a mechanism to customize the log format of civetweb.
>>> It's already writing IP, HTTP Verb, path, response and time, but I'm hoping
>>> to get it to print the Authorization header of the request, which containers
>>> the access key id which we can tie back into the systems we use to issue
>>> credentials. Any thoughts?
>>>
>>> Thanks,
>>> Aaron
>>> CONFIDENTIALITY NOTICE
>>> This e-mail message and any attachments are only for the use of the
>>> intended recipient and may contain information that is privileged,
>>> confidential or exempt from disclosure under applicable law. If you are not
>>> the intended recipient, any disclosure, distribution or other use of this
>>> e-mail message or attachments is prohibited. If you have received this
>>> e-mail message in error, please delete and notify the sender immediately.
>>> Thank you.
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Brad Hubbard
On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra
 wrote:
> I noticed a similar crash too. Unfortunately, I did not get much info in the
> logs.
>
>  *** Caught signal (Segmentation fault) **
>
> Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:  in thread 7f63a0a97700
> thread_name:safe_timer
>
> Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
> 797138 Segmentation fault  (core dumped) "$@"

The log isn't very helpful AFAICT. Are these both container
environments? If so, what are the details (OS, etc.).

Can anyone capture a core file? Please feel free to open a tracker on this.

>
>
> Thanks
>
> Subhachandra
>
>
>
> On Thu, Mar 8, 2018 at 6:00 AM, Dietmar Rieder 
> wrote:
>>
>> Hi,
>>
>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>> going down.
>> While checking the osd logs for the affected OSD I found that the osd
>> was seg faulting:
>>
>> []
>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>> (Segmentation fault) **
>>  in thread 7fd9af370700 thread_name:safe_timer
>>
>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>>1: (()+0xa3c611) [0x564585904611]
>> 2: (()+0xf5e0) [0x7fd9b66305e0]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>> [...]
>>
>> Should I open a ticket for this? What additional information is needed?
>>
>>
>> I put the relevant log entries for download under [1], so maybe someone
>> with more
>> experience can find some useful information therein.
>>
>> Thanks
>>   Dietmar
>>
>>
>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>
>> --
>> _
>> D i e t m a r  R i e d e r, Mag.Dr.
>> Innsbruck Medical University
>> Biocenter - Division for Bioinformatics
>> Email: dietmar.rie...@i-med.ac.at
>> Web:   http://www.icbi.at
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs

2018-03-08 Thread David Turner
PGs being unevenly distributed is a common occurrence in Ceph.  Luminous
started making some steps towards correcting this, but you're in Jewel.
There are a lot of threads in the ML archives about fixing PG
distribution.  Generally every method comes down to increasing the weight
on OSDs with too few PGs and decreasing the weight on the OSDs with too
many PGs.  There are a lot of schools of thought on the best way to
implement this in your environment which has everything to do with your
client IO patterns and workloads.  Looking into `ceph osd reweight-by-pg`
might be a good place for you to start as you are only looking at 1 pool in
your cluster.  If you have more pools, you generally need `ceph osd
reweight-by-utilization`.

On Wed, Mar 7, 2018 at 8:19 AM shadow_lin  wrote:

> Hi list,
>Ceph version is jewel 10.2.10 and all osd are using filestore.
> The Cluster has 96 osds and 1 pool with size=2 replication with 4096
> pg(base on pg calculate method from ceph doc for 100pg/per osd).
> The osd with the most pg count has 104 PGs and there are 6 osds have above
> 100 PGs
> Most of the osd have around 7x-9x PGs
> The osd with the least pg count has 58 PGs
>
> During the write test some of the osds have very high fs_apply_latency
> like 1000ms-4000ms while the normal ones are like 100-600ms. The osds with
> high latency are always the ones with more pg on it.
>
> iostat on the high latency osd shows the hdds are having high %util at
> about 95%-96% while the normal ones are having %util at 40%-60%
>
> I think the reason to cause this is because the osds have more pgs need to
> handle more write request to it.Is this right?
> But even though the pg distribution is not even but the variation is not
> that much.How could the performance be so sensitive to it?
>
> Is there anything I can do to improve the performance and reduce the
> latency?
>
> How can I make the pg distribution to be more even?
>
> Thanks
>
>
> 2018-03-07
> --
> shadowlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] set pg_num on pools with different size

2018-03-08 Thread Nagy Ákos
Hi,

we have a ceph cluster with 3 cluster nodes and 20 OSD's, with 6-7-7 2
TB HDD/s per node.

In long term we want to use 7-9 pools, and for 20 OSD and 8 pools I
calculate that the ideal pg_num was 250 (20 * 100 / 8).

In this case normally each OSD store 100 pg's, that is the recommanded.

I have few problems:

1. I have 1736 pg's, and if I want to create a new pool with 270 pg's, I
got the error:

Error ERANGE:  pg_num 270 size 2 would mean 4012 total pgs, which
exceeds max 4000 (mon_max_pg_per_osd 200 * num_in_osds 20)


2. Now we have 8 pools, but only one of them store huge amount of data,
and for this reason I got a warning:

health: HEALTH_WARN
    1 pools have many more objects per pg than average

But in past I remember that I got a warning that the pg_num for a pool
is less/more then the average pg_num in cluster.


In this case how can I set the optimal pg_num for my pools?

Some debug data:

OSD number: 20

  data:
    pools:   8 pools, 1736 pgs
    objects: 560k objects, 1141 GB
    usage:   2331 GB used, 30053 GB / 32384 GB avail
    pgs: 1736 active+clean
           
           
POOLS:
    NAME    ID USED   %USED MAX AVAIL OBJECTS
    kvmpool 5  34094M  0.24    13833G    8573
    rbd 6    155G  1.11    13833G   94056
    lxdhv04 15 29589M  0.21    13833G   12805
    lxdhv01 16 14480M  0.10    13833G    9732
    lxdhv02 17 14840M  0.10    13833G    7931
    lxdhv03 18 18735M  0.13    13833G    7567
    cephfs-metadata 22 40433k 0    13833G   11336
    cephfs-data 23   876G  5.96    13833G  422108

   
pool 5 'kvmpool' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 1909 lfor 0/1906 owner
18446744073709551615 flags hashpspool stripe_width 0 application rbd
pool 6 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 8422 lfor 0/2375 owner
18446744073709551615 flags hashpspool stripe_width 0 application rbd
pool 15 'lxdhv04' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 3053 flags hashpspool
stripe_width 0 application rbd
pool 16 'lxdhv01' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 3054 flags hashpspool
stripe_width 0 application rbd
pool 17 'lxdhv02' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 8409 flags hashpspool
stripe_width 0 application rbd
pool 18 'lxdhv03' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 3066 flags hashpspool
stripe_width 0 application rbd
pool 22 'cephfs-metadata' replicated size 2 min_size 1 crush_rule 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 8405 flags
hashpspool stripe_width 0 application cephfs
pool 23 'cephfs-data' replicated size 2 min_size 1 crush_rule 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 8405 flags
hashpspool stripe_width 0 application cephfs


-- 
Ákos

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-08 Thread David Turner
I remember some time ago Yehuda had commented on a thread like this saying
that it would make sense to add a logging/auditing feature like this to
RGW.  I haven't heard much about it since then, though.  Yehuda, do you
remember that and/or think that logging like this might become viable.

On Thu, Mar 8, 2018 at 4:17 PM Aaron Bassett 
wrote:

> Yea thats what I was afraid of. I'm looking at possibly patching to add
> it, but i really dont want to support my own builds. I suppose other
> alternatives are to use proxies to log stuff, but that makes me sad.
>
> Aaron
>
>
> On Mar 8, 2018, at 12:36 PM, David Turner  wrote:
>
> Setting radosgw debug logging to 10/10 is the only way I've been able to
> get the access key in the logs for requests.  It's very unfortunate as it
> DRASTICALLY increases the amount of log per request, but it's what we
> needed to do to be able to have the access key in the logs along with the
> request.
>
> On Tue, Mar 6, 2018 at 3:09 PM Aaron Bassett 
> wrote:
>
>> Hey all,
>> I'm trying to get something of an audit log out of radosgw. To that end I
>> was wondering if theres a mechanism to customize the log format of
>> civetweb. It's already writing IP, HTTP Verb, path, response and time, but
>> I'm hoping to get it to print the Authorization header of the request,
>> which containers the access key id which we can tie back into the systems
>> we use to issue credentials. Any thoughts?
>>
>> Thanks,
>> Aaron
>> CONFIDENTIALITY NOTICE
>> This e-mail message and any attachments are only for the use of the
>> intended recipient and may contain information that is privileged,
>> confidential or exempt from disclosure under applicable law. If you are not
>> the intended recipient, any disclosure, distribution or other use of this
>> e-mail message or attachments is prohibited. If you have received this
>> e-mail message in error, please delete and notify the sender immediately.
>> Thank you.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-08 Thread Aaron Bassett
Yea thats what I was afraid of. I'm looking at possibly patching to add it, but 
i really dont want to support my own builds. I suppose other alternatives are 
to use proxies to log stuff, but that makes me sad.

Aaron

On Mar 8, 2018, at 12:36 PM, David Turner 
> wrote:

Setting radosgw debug logging to 10/10 is the only way I've been able to get 
the access key in the logs for requests.  It's very unfortunate as it 
DRASTICALLY increases the amount of log per request, but it's what we needed to 
do to be able to have the access key in the logs along with the request.

On Tue, Mar 6, 2018 at 3:09 PM Aaron Bassett 
> wrote:
Hey all,
I'm trying to get something of an audit log out of radosgw. To that end I was 
wondering if theres a mechanism to customize the log format of civetweb. It's 
already writing IP, HTTP Verb, path, response and time, but I'm hoping to get 
it to print the Authorization header of the request, which containers the 
access key id which we can tie back into the systems we use to issue 
credentials. Any thoughts?

Thanks,
Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Lazuardi Nasution
Hi Jason,

I understand. Thank you for your explanation.

Best regards,

On Mar 9, 2018 3:45 AM, "Jason Dillaman"  wrote:

> On Thu, Mar 8, 2018 at 3:41 PM, Lazuardi Nasution
>  wrote:
> > Hi Jason,
> >
> > If there is the case that the gateway cannot access the Ceph, I think you
> > are right. Anyway, I put iSCSI Gateway on MON node.
>
> It's connectivity to the specific OSD associated to the IO operation
> that is the issue. If you understand the risks and are comfortable
> with them, active/active is a perfectly acceptable solution. I just
> wanted to ensure you understood the risk since you stated corruption
> "seems impossible".
>
> > Best regards,
> >
> >
> > On Mar 9, 2018 1:41 AM, "Jason Dillaman"  wrote:
> >
> > On Thu, Mar 8, 2018 at 12:47 PM, Lazuardi Nasution
> >  wrote:
> >> Jason,
> >>
> >> As long you don't activate any cache and single image for single client
> >> only, it seem impossible to have old data overwrite. May be, it is
> related
> >> to I/O pattern too. Anyway, maybe other Ceph users have different
> >> experience. It can be different result with different case.
> >
> > Write operation (A) is sent to gateway X who cannot access the Ceph
> > cluster so the IO is queued. The initiator's multipath layer times out
> > and resents write operation (A) to gateway Y, followed by write
> > operation (A') to gateway Y. Shortly thereafter, gateway X is able to
> > send its delayed write operation (A) to the Ceph cluster and
> > overwrites write operation (A') -- thus your data went back in time.
> >
> >> Best regards,
> >>
> >>
> >> On Mar 9, 2018 12:35 AM, "Jason Dillaman"  wrote:
> >>
> >> On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
> >>  wrote:
> >>> Hi Mike,
> >>>
> >>> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
> of
> >>> multiple gateways. Of course I have to disable any write back cache at
> >>> any
> >>> level (RBD cache and TGT cache). It seem to be safe to disable
> exclusive
> >>> lock since each RBD image is accessed only by single client and as long
> >>> as
> >>> I
> >>> know mostly ALUA use RR of I/O path.
> >>
> >> How do you figure that's safe for preventing an overwrite with old
> >> data in an active/active path hiccup?
> >>
> >>> Best regards,
> >>>
> >>> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:
> 
>  On 03/07/2018 09:24 AM, shadow_lin wrote:
>  > Hi Christie,
>  > Is it safe to use active/passive multipath with krbd with exclusive
>  > lock
>  > for lio/tgt/scst/tcmu?
> 
>  No. We tried to use lio and krbd initially, but there is a issue where
>  IO might get stuck in the target/block layer and get executed after
> new
>  IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>  add some code tcmu's file_example handler which can be used with krbd
> so
>  it works like the rbd one.
> 
>  I do know enough about SCST right now.
> 
> 
>  > Is it safe to use active/active multipath If use suse kernel with
>  > target_core_rbd?
>  > Thanks.
>  >
>  > 2018-03-07
>  >
>  >
>  > 
> 
>  > shadowlin
>  >
>  >
>  >
>  > 
> 
>  >
>  > *发件人:*Mike Christie 
>  > *发送时间:*2018-03-07 03:51
>  > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>  > Exclusive Lock
>  > *收件人:*"Lazuardi Nasution","Ceph
>  > Users"
>  > *抄送:*
>  >
>  > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>  > > Hi,
>  > >
>  > > I want to do load balanced multipathing (multiple iSCSI
>  > gateway/exporter
>  > > nodes) of iSCSI backed with RBD images. Should I disable
>  > exclusive
>  > lock
>  > > feature? What if I don't disable that feature? I'm using TGT
>  > (manual
>  > > way) since I get so many CPU stuck error messages when I was
>  > using
>  > LIO.
>  > >
>  >
>  > You are using LIO/TGT with krbd right?
>  >
>  > You cannot or shouldn't do active/active multipathing. If you
> have
>  > the
>  > lock enabled then it bounces between paths for each IO and will
> be
>  > slow.
>  > If you do not have it enabled then you can end up with stale IO
>  > overwriting current data.
>  >
>  >
>  >
>  >
> 
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >>
> >>
> >> --
> >> Jason
> >>
> >>
> >>
> 

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Jason Dillaman
On Thu, Mar 8, 2018 at 3:41 PM, Lazuardi Nasution
 wrote:
> Hi Jason,
>
> If there is the case that the gateway cannot access the Ceph, I think you
> are right. Anyway, I put iSCSI Gateway on MON node.

It's connectivity to the specific OSD associated to the IO operation
that is the issue. If you understand the risks and are comfortable
with them, active/active is a perfectly acceptable solution. I just
wanted to ensure you understood the risk since you stated corruption
"seems impossible".

> Best regards,
>
>
> On Mar 9, 2018 1:41 AM, "Jason Dillaman"  wrote:
>
> On Thu, Mar 8, 2018 at 12:47 PM, Lazuardi Nasution
>  wrote:
>> Jason,
>>
>> As long you don't activate any cache and single image for single client
>> only, it seem impossible to have old data overwrite. May be, it is related
>> to I/O pattern too. Anyway, maybe other Ceph users have different
>> experience. It can be different result with different case.
>
> Write operation (A) is sent to gateway X who cannot access the Ceph
> cluster so the IO is queued. The initiator's multipath layer times out
> and resents write operation (A) to gateway Y, followed by write
> operation (A') to gateway Y. Shortly thereafter, gateway X is able to
> send its delayed write operation (A) to the Ceph cluster and
> overwrites write operation (A') -- thus your data went back in time.
>
>> Best regards,
>>
>>
>> On Mar 9, 2018 12:35 AM, "Jason Dillaman"  wrote:
>>
>> On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
>>  wrote:
>>> Hi Mike,
>>>
>>> Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
>>> multiple gateways. Of course I have to disable any write back cache at
>>> any
>>> level (RBD cache and TGT cache). It seem to be safe to disable exclusive
>>> lock since each RBD image is accessed only by single client and as long
>>> as
>>> I
>>> know mostly ALUA use RR of I/O path.
>>
>> How do you figure that's safe for preventing an overwrite with old
>> data in an active/active path hiccup?
>>
>>> Best regards,
>>>
>>> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:

 On 03/07/2018 09:24 AM, shadow_lin wrote:
 > Hi Christie,
 > Is it safe to use active/passive multipath with krbd with exclusive
 > lock
 > for lio/tgt/scst/tcmu?

 No. We tried to use lio and krbd initially, but there is a issue where
 IO might get stuck in the target/block layer and get executed after new
 IO. So for lio, tgt and tcmu it is not safe as is right now. We could
 add some code tcmu's file_example handler which can be used with krbd so
 it works like the rbd one.

 I do know enough about SCST right now.


 > Is it safe to use active/active multipath If use suse kernel with
 > target_core_rbd?
 > Thanks.
 >
 > 2018-03-07
 >
 >
 > 
 > shadowlin
 >
 >
 >
 > 
 >
 > *发件人:*Mike Christie 
 > *发送时间:*2018-03-07 03:51
 > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
 > Exclusive Lock
 > *收件人:*"Lazuardi Nasution","Ceph
 > Users"
 > *抄送:*
 >
 > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
 > > Hi,
 > >
 > > I want to do load balanced multipathing (multiple iSCSI
 > gateway/exporter
 > > nodes) of iSCSI backed with RBD images. Should I disable
 > exclusive
 > lock
 > > feature? What if I don't disable that feature? I'm using TGT
 > (manual
 > > way) since I get so many CPU stuck error messages when I was
 > using
 > LIO.
 > >
 >
 > You are using LIO/TGT with krbd right?
 >
 > You cannot or shouldn't do active/active multipathing. If you have
 > the
 > lock enabled then it bounces between paths for each IO and will be
 > slow.
 > If you do not have it enabled then you can end up with stale IO
 > overwriting current data.
 >
 >
 >
 >

>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> --
>> Jason
>>
>>
>>
>
>
>
> --
> Jason
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Lazuardi Nasution
Hi Jason,

If there is the case that the gateway cannot access the Ceph, I think you
are right. Anyway, I put iSCSI Gateway on MON node.

Best regards,


On Mar 9, 2018 1:41 AM, "Jason Dillaman"  wrote:

On Thu, Mar 8, 2018 at 12:47 PM, Lazuardi Nasution
 wrote:
> Jason,
>
> As long you don't activate any cache and single image for single client
> only, it seem impossible to have old data overwrite. May be, it is related
> to I/O pattern too. Anyway, maybe other Ceph users have different
> experience. It can be different result with different case.

Write operation (A) is sent to gateway X who cannot access the Ceph
cluster so the IO is queued. The initiator's multipath layer times out
and resents write operation (A) to gateway Y, followed by write
operation (A') to gateway Y. Shortly thereafter, gateway X is able to
send its delayed write operation (A) to the Ceph cluster and
overwrites write operation (A') -- thus your data went back in time.

> Best regards,
>
>
> On Mar 9, 2018 12:35 AM, "Jason Dillaman"  wrote:
>
> On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
>  wrote:
>> Hi Mike,
>>
>> Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
>> multiple gateways. Of course I have to disable any write back cache at
any
>> level (RBD cache and TGT cache). It seem to be safe to disable exclusive
>> lock since each RBD image is accessed only by single client and as long
as
>> I
>> know mostly ALUA use RR of I/O path.
>
> How do you figure that's safe for preventing an overwrite with old
> data in an active/active path hiccup?
>
>> Best regards,
>>
>> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:
>>>
>>> On 03/07/2018 09:24 AM, shadow_lin wrote:
>>> > Hi Christie,
>>> > Is it safe to use active/passive multipath with krbd with exclusive
>>> > lock
>>> > for lio/tgt/scst/tcmu?
>>>
>>> No. We tried to use lio and krbd initially, but there is a issue where
>>> IO might get stuck in the target/block layer and get executed after new
>>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>>> add some code tcmu's file_example handler which can be used with krbd so
>>> it works like the rbd one.
>>>
>>> I do know enough about SCST right now.
>>>
>>>
>>> > Is it safe to use active/active multipath If use suse kernel with
>>> > target_core_rbd?
>>> > Thanks.
>>> >
>>> > 2018-03-07
>>> >
>>> > 

>>> > shadowlin
>>> >
>>> >
>>> > 

>>> >
>>> > *发件人:*Mike Christie 
>>> > *发送时间:*2018-03-07 03:51
>>> > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>>> > Exclusive Lock
>>> > *收件人:*"Lazuardi Nasution","Ceph
>>> > Users"
>>> > *抄送:*
>>> >
>>> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>>> > > Hi,
>>> > >
>>> > > I want to do load balanced multipathing (multiple iSCSI
>>> > gateway/exporter
>>> > > nodes) of iSCSI backed with RBD images. Should I disable
>>> > exclusive
>>> > lock
>>> > > feature? What if I don't disable that feature? I'm using TGT
>>> > (manual
>>> > > way) since I get so many CPU stuck error messages when I was
>>> > using
>>> > LIO.
>>> > >
>>> >
>>> > You are using LIO/TGT with krbd right?
>>> >
>>> > You cannot or shouldn't do active/active multipathing. If you have
>>> > the
>>> > lock enabled then it bounces between paths for each IO and will be
>>> > slow.
>>> > If you do not have it enabled then you can end up with stale IO
>>> > overwriting current data.
>>> >
>>> >
>>> >
>>> >
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Jason
>
>
>



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Jason Dillaman
On Thu, Mar 8, 2018 at 2:11 PM, Ashish Samant  wrote:
>
>
> On 03/08/2018 10:44 AM, Mike Christie wrote:
>>
>> On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:
>>>
>>> Hi Mike,
>>>
>>> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
>>> of multiple gateways. Of course I have to disable any write back cache
>>> at any level (RBD cache and TGT cache). It seem to be safe to disable
>>> exclusive lock since each RBD image is accessed only by single client
>>> and as long as I know mostly ALUA use RR of I/O path.
>>
>> It might be possible if you have configured your timers correctly but I
>> do not think anyone has figured it all out yet.
>>
>> Here is a simple but long example of the problem. Sorry for the length,
>> but I want to make sure people know the risks.
>>
>> You have 2 iscsi target nodes and 1 iscsi initiator connected to both
>> doing active/active over them.
>>
>> To make it really easy to hit, the iscsi initiator should be connected
>> to the target with a different nic port or network than what is being
>> used for ceph traffic.
>>
>> 1. Prep the data. Just clear the first sector of your iscsi disk. On the
>> initiator system do:
>>
>> dd if=/dev/zero of=/dev/sdb count=1 ofile=direct
>>
>> 2. Kill the network/port for one of the iscsi targets ceph traffic. So
>> for example on target node 1 pull its cable for ceph traffic if you set
>> it up where iscsi and ceph use different physical ports. iSCSI traffic
>> should be unaffected for this test.
>>
>> 3. Write some new data over the sector we just wrote in #1. This will
>> get sent from the initiator to the target ok, but get stuck in the
>> rbd/ceph layer since that network is down:
>>
>> dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct
>>
>> 4. The initiator's eh timers will fire and that will fail and will the
>> command will get failed and retired on the other path. After that dd in
>> #3 completes run:
>>
>> dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct
>>
>> This should execute quickly since it goes through the good iscsi and
>> ceph path right away.
>>
>> 5. Now plug the cable back in and wait for maybe 30 seconds for the
>> network to come back up and the stuck command to run.
>>
>> 6. Now do
>>
>> dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct
>>
>> The data is going to be the data sent in step 3 and not the new data in
>> step 4.
>>
>> To get around this issue you could try to set the krbd
>> osd_request_timeout to a value shorter than the initiator side failover
>> time out (for multipath-tools/open-iscsi in linux this would be
>> fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
>> account for the transport related timers that might short circut/bypass
>> the TMF based EH.
>>
>> One problem with trying to rely on configuring that is handling all the
>> corner cases. So you have:
>>
>> - Transport (nop) timer or SCSI/TMF command timer set so the
>> fast_io_fail/replacement timer starts at N seconds and then fires at M.
>> - It is a really bad connection so it takes N - 1 seconds to get the
>> SCSI command from the initiator to target.
>> - At the N second mark the iscsi connection is dropped the
>> fast_io_fail/replacement timer is started.
>>
>> For the easy case, the SCSI command is sent directly to krbd and so if
>> osd_request_timeout is less than M seconds then the command will be
>> failed in time and we would not hit the problem above.
>>
>> If something happens in the target stack like the SCSI command gets
>> stuck/queued then your osd_request_timeout value might be too short. For
>> example, if you were using tgt/lio right now and this was a
>> COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
>> seconds, and then the write part might take osd_request_timeout -1
>> seconds so you need to have your fast_io_fail long enough for that type
>> of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
>> need to make sure your queue depths are set so you do not end up with
>> something similar as the CAW but where M WRITEs get executed and take
>> osd_request_timeout -1 seconds then M more, etc and at some point the
>> iscsi connection is lost so the failover timer had started. Some ceph
>> requests also might be multiple requests.
>>
>> Maybe an overly paranoid case, but I still worry about because I do not
>> want to mess up anyone's data, is that a disk on the iscsi target node
>> goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
>> command, and that blocks trying to write data to the flakey disk. If the
>> disk recovers and we can eventually recover, did you account for the
>> recovery timers in that code path when configuring the failover and krbd
>> timers.
>>
>> One other case we have been debating about is if krbd/librbd is able to
>> put the ceph request on the wire but then the iscsi connection goes
>> down, will the ceph request always get sent to the 

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Ashish Samant



On 03/08/2018 10:44 AM, Mike Christie wrote:

On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:

Hi Mike,

Since I have moved from LIO to TGT, I can do full ALUA (active/active)
of multiple gateways. Of course I have to disable any write back cache
at any level (RBD cache and TGT cache). It seem to be safe to disable
exclusive lock since each RBD image is accessed only by single client
and as long as I know mostly ALUA use RR of I/O path.

It might be possible if you have configured your timers correctly but I
do not think anyone has figured it all out yet.

Here is a simple but long example of the problem. Sorry for the length,
but I want to make sure people know the risks.

You have 2 iscsi target nodes and 1 iscsi initiator connected to both
doing active/active over them.

To make it really easy to hit, the iscsi initiator should be connected
to the target with a different nic port or network than what is being
used for ceph traffic.

1. Prep the data. Just clear the first sector of your iscsi disk. On the
initiator system do:

dd if=/dev/zero of=/dev/sdb count=1 ofile=direct

2. Kill the network/port for one of the iscsi targets ceph traffic. So
for example on target node 1 pull its cable for ceph traffic if you set
it up where iscsi and ceph use different physical ports. iSCSI traffic
should be unaffected for this test.

3. Write some new data over the sector we just wrote in #1. This will
get sent from the initiator to the target ok, but get stuck in the
rbd/ceph layer since that network is down:

dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct

4. The initiator's eh timers will fire and that will fail and will the
command will get failed and retired on the other path. After that dd in
#3 completes run:

dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct

This should execute quickly since it goes through the good iscsi and
ceph path right away.

5. Now plug the cable back in and wait for maybe 30 seconds for the
network to come back up and the stuck command to run.

6. Now do

dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct

The data is going to be the data sent in step 3 and not the new data in
step 4.

To get around this issue you could try to set the krbd
osd_request_timeout to a value shorter than the initiator side failover
time out (for multipath-tools/open-iscsi in linux this would be
fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
account for the transport related timers that might short circut/bypass
the TMF based EH.

One problem with trying to rely on configuring that is handling all the
corner cases. So you have:

- Transport (nop) timer or SCSI/TMF command timer set so the
fast_io_fail/replacement timer starts at N seconds and then fires at M.
- It is a really bad connection so it takes N - 1 seconds to get the
SCSI command from the initiator to target.
- At the N second mark the iscsi connection is dropped the
fast_io_fail/replacement timer is started.

For the easy case, the SCSI command is sent directly to krbd and so if
osd_request_timeout is less than M seconds then the command will be
failed in time and we would not hit the problem above.

If something happens in the target stack like the SCSI command gets
stuck/queued then your osd_request_timeout value might be too short. For
example, if you were using tgt/lio right now and this was a
COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
seconds, and then the write part might take osd_request_timeout -1
seconds so you need to have your fast_io_fail long enough for that type
of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
need to make sure your queue depths are set so you do not end up with
something similar as the CAW but where M WRITEs get executed and take
osd_request_timeout -1 seconds then M more, etc and at some point the
iscsi connection is lost so the failover timer had started. Some ceph
requests also might be multiple requests.

Maybe an overly paranoid case, but I still worry about because I do not
want to mess up anyone's data, is that a disk on the iscsi target node
goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
command, and that blocks trying to write data to the flakey disk. If the
disk recovers and we can eventually recover, did you account for the
recovery timers in that code path when configuring the failover and krbd
timers.

One other case we have been debating about is if krbd/librbd is able to
put the ceph request on the wire but then the iscsi connection goes
down, will the ceph request always get sent to the OSD before the
initiator side failover timeouts have fired and it starts using a
different target node.
If krbd/librbd is able to put the ceph request on the wire, then that 
could cause data corruption in the

active/passive case too, right?

Thanks,
Ashish





Best regards,

On Mar 8, 2018 11:54 PM, "Mike Christie" > wrote:

 

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Mike Christie
On 03/08/2018 12:44 PM, Mike Christie wrote:
> stuck/queued then your osd_request_timeout value might be too short. For

Sorry, I meant too long.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Mike Christie
On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:
> Hi Mike,
> 
> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
> of multiple gateways. Of course I have to disable any write back cache
> at any level (RBD cache and TGT cache). It seem to be safe to disable
> exclusive lock since each RBD image is accessed only by single client
> and as long as I know mostly ALUA use RR of I/O path.

It might be possible if you have configured your timers correctly but I
do not think anyone has figured it all out yet.

Here is a simple but long example of the problem. Sorry for the length,
but I want to make sure people know the risks.

You have 2 iscsi target nodes and 1 iscsi initiator connected to both
doing active/active over them.

To make it really easy to hit, the iscsi initiator should be connected
to the target with a different nic port or network than what is being
used for ceph traffic.

1. Prep the data. Just clear the first sector of your iscsi disk. On the
initiator system do:

dd if=/dev/zero of=/dev/sdb count=1 ofile=direct

2. Kill the network/port for one of the iscsi targets ceph traffic. So
for example on target node 1 pull its cable for ceph traffic if you set
it up where iscsi and ceph use different physical ports. iSCSI traffic
should be unaffected for this test.

3. Write some new data over the sector we just wrote in #1. This will
get sent from the initiator to the target ok, but get stuck in the
rbd/ceph layer since that network is down:

dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct

4. The initiator's eh timers will fire and that will fail and will the
command will get failed and retired on the other path. After that dd in
#3 completes run:

dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct

This should execute quickly since it goes through the good iscsi and
ceph path right away.

5. Now plug the cable back in and wait for maybe 30 seconds for the
network to come back up and the stuck command to run.

6. Now do

dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct

The data is going to be the data sent in step 3 and not the new data in
step 4.

To get around this issue you could try to set the krbd
osd_request_timeout to a value shorter than the initiator side failover
time out (for multipath-tools/open-iscsi in linux this would be
fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
account for the transport related timers that might short circut/bypass
the TMF based EH.

One problem with trying to rely on configuring that is handling all the
corner cases. So you have:

- Transport (nop) timer or SCSI/TMF command timer set so the
fast_io_fail/replacement timer starts at N seconds and then fires at M.
- It is a really bad connection so it takes N - 1 seconds to get the
SCSI command from the initiator to target.
- At the N second mark the iscsi connection is dropped the
fast_io_fail/replacement timer is started.

For the easy case, the SCSI command is sent directly to krbd and so if
osd_request_timeout is less than M seconds then the command will be
failed in time and we would not hit the problem above.

If something happens in the target stack like the SCSI command gets
stuck/queued then your osd_request_timeout value might be too short. For
example, if you were using tgt/lio right now and this was a
COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
seconds, and then the write part might take osd_request_timeout -1
seconds so you need to have your fast_io_fail long enough for that type
of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
need to make sure your queue depths are set so you do not end up with
something similar as the CAW but where M WRITEs get executed and take
osd_request_timeout -1 seconds then M more, etc and at some point the
iscsi connection is lost so the failover timer had started. Some ceph
requests also might be multiple requests.

Maybe an overly paranoid case, but I still worry about because I do not
want to mess up anyone's data, is that a disk on the iscsi target node
goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
command, and that blocks trying to write data to the flakey disk. If the
disk recovers and we can eventually recover, did you account for the
recovery timers in that code path when configuring the failover and krbd
timers.

One other case we have been debating about is if krbd/librbd is able to
put the ceph request on the wire but then the iscsi connection goes
down, will the ceph request always get sent to the OSD before the
initiator side failover timeouts have fired and it starts using a
different target node.



> Best regards,
> 
> On Mar 8, 2018 11:54 PM, "Mike Christie"  > wrote:
> 
> On 03/07/2018 09:24 AM, shadow_lin wrote:
> > Hi Christie,
> > Is it safe to use active/passive multipath with krbd with
> exclusive lock
> > for 

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Jason Dillaman
On Thu, Mar 8, 2018 at 12:47 PM, Lazuardi Nasution
 wrote:
> Jason,
>
> As long you don't activate any cache and single image for single client
> only, it seem impossible to have old data overwrite. May be, it is related
> to I/O pattern too. Anyway, maybe other Ceph users have different
> experience. It can be different result with different case.

Write operation (A) is sent to gateway X who cannot access the Ceph
cluster so the IO is queued. The initiator's multipath layer times out
and resents write operation (A) to gateway Y, followed by write
operation (A') to gateway Y. Shortly thereafter, gateway X is able to
send its delayed write operation (A) to the Ceph cluster and
overwrites write operation (A') -- thus your data went back in time.

> Best regards,
>
>
> On Mar 9, 2018 12:35 AM, "Jason Dillaman"  wrote:
>
> On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
>  wrote:
>> Hi Mike,
>>
>> Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
>> multiple gateways. Of course I have to disable any write back cache at any
>> level (RBD cache and TGT cache). It seem to be safe to disable exclusive
>> lock since each RBD image is accessed only by single client and as long as
>> I
>> know mostly ALUA use RR of I/O path.
>
> How do you figure that's safe for preventing an overwrite with old
> data in an active/active path hiccup?
>
>> Best regards,
>>
>> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:
>>>
>>> On 03/07/2018 09:24 AM, shadow_lin wrote:
>>> > Hi Christie,
>>> > Is it safe to use active/passive multipath with krbd with exclusive
>>> > lock
>>> > for lio/tgt/scst/tcmu?
>>>
>>> No. We tried to use lio and krbd initially, but there is a issue where
>>> IO might get stuck in the target/block layer and get executed after new
>>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>>> add some code tcmu's file_example handler which can be used with krbd so
>>> it works like the rbd one.
>>>
>>> I do know enough about SCST right now.
>>>
>>>
>>> > Is it safe to use active/active multipath If use suse kernel with
>>> > target_core_rbd?
>>> > Thanks.
>>> >
>>> > 2018-03-07
>>> >
>>> > 
>>> > shadowlin
>>> >
>>> >
>>> > 
>>> >
>>> > *发件人:*Mike Christie 
>>> > *发送时间:*2018-03-07 03:51
>>> > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>>> > Exclusive Lock
>>> > *收件人:*"Lazuardi Nasution","Ceph
>>> > Users"
>>> > *抄送:*
>>> >
>>> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>>> > > Hi,
>>> > >
>>> > > I want to do load balanced multipathing (multiple iSCSI
>>> > gateway/exporter
>>> > > nodes) of iSCSI backed with RBD images. Should I disable
>>> > exclusive
>>> > lock
>>> > > feature? What if I don't disable that feature? I'm using TGT
>>> > (manual
>>> > > way) since I get so many CPU stuck error messages when I was
>>> > using
>>> > LIO.
>>> > >
>>> >
>>> > You are using LIO/TGT with krbd right?
>>> >
>>> > You cannot or shouldn't do active/active multipathing. If you have
>>> > the
>>> > lock enabled then it bounces between paths for each IO and will be
>>> > slow.
>>> > If you do not have it enabled then you can end up with stale IO
>>> > overwriting current data.
>>> >
>>> >
>>> >
>>> >
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Jason
>
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Subhachandra Chandra
I noticed a similar crash too. Unfortunately, I did not get much info in
the logs.

 *** Caught signal (Segmentation fault) **

Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:  in thread 7f63a0a97700
thread_name:safe_timer

Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
797138 Segmentation fault  (core dumped) "$@"


Thanks

Subhachandra


On Thu, Mar 8, 2018 at 6:00 AM, Dietmar Rieder 
wrote:

> Hi,
>
> I noticed in my client (using cephfs) logs that an osd was unexpectedly
> going down.
> While checking the osd logs for the affected OSD I found that the osd
> was seg faulting:
>
> []
> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7fd9af370700 thread_name:safe_timer
>
>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
> luminous (stable)
>1: (()+0xa3c611) [0x564585904611]
> 2: (()+0xf5e0) [0x7fd9b66305e0]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> [...]
>
> Should I open a ticket for this? What additional information is needed?
>
>
> I put the relevant log entries for download under [1], so maybe someone
> with more
> experience can find some useful information therein.
>
> Thanks
>   Dietmar
>
>
> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Lazuardi Nasution
Jason,

As long you don't activate any cache and single image for single client
only, it seem impossible to have old data overwrite. May be, it is related
to I/O pattern too. Anyway, maybe other Ceph users have different
experience. It can be different result with different case.

Best regards,


On Mar 9, 2018 12:35 AM, "Jason Dillaman"  wrote:

On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
 wrote:
> Hi Mike,
>
> Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
> multiple gateways. Of course I have to disable any write back cache at any
> level (RBD cache and TGT cache). It seem to be safe to disable exclusive
> lock since each RBD image is accessed only by single client and as long
as I
> know mostly ALUA use RR of I/O path.

How do you figure that's safe for preventing an overwrite with old
data in an active/active path hiccup?

> Best regards,
>
> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:
>>
>> On 03/07/2018 09:24 AM, shadow_lin wrote:
>> > Hi Christie,
>> > Is it safe to use active/passive multipath with krbd with exclusive
lock
>> > for lio/tgt/scst/tcmu?
>>
>> No. We tried to use lio and krbd initially, but there is a issue where
>> IO might get stuck in the target/block layer and get executed after new
>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>> add some code tcmu's file_example handler which can be used with krbd so
>> it works like the rbd one.
>>
>> I do know enough about SCST right now.
>>
>>
>> > Is it safe to use active/active multipath If use suse kernel with
>> > target_core_rbd?
>> > Thanks.
>> >
>> > 2018-03-07
>> > 

>> > shadowlin
>> >
>> > 

>> >
>> > *发件人:*Mike Christie 
>> > *发送时间:*2018-03-07 03:51
>> > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>> > Exclusive Lock
>> > *收件人:*"Lazuardi Nasution","Ceph
>> > Users"
>> > *抄送:*
>> >
>> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>> > > Hi,
>> > >
>> > > I want to do load balanced multipathing (multiple iSCSI
>> > gateway/exporter
>> > > nodes) of iSCSI backed with RBD images. Should I disable
exclusive
>> > lock
>> > > feature? What if I don't disable that feature? I'm using TGT
>> > (manual
>> > > way) since I get so many CPU stuck error messages when I was
using
>> > LIO.
>> > >
>> >
>> > You are using LIO/TGT with krbd right?
>> >
>> > You cannot or shouldn't do active/active multipathing. If you have
>> > the
>> > lock enabled then it bounces between paths for each IO and will be
>> > slow.
>> > If you do not have it enabled then you can end up with stale IO
>> > overwriting current data.
>> >
>> >
>> >
>> >
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-08 Thread David Turner
Setting radosgw debug logging to 10/10 is the only way I've been able to
get the access key in the logs for requests.  It's very unfortunate as it
DRASTICALLY increases the amount of log per request, but it's what we
needed to do to be able to have the access key in the logs along with the
request.

On Tue, Mar 6, 2018 at 3:09 PM Aaron Bassett 
wrote:

> Hey all,
> I'm trying to get something of an audit log out of radosgw. To that end I
> was wondering if theres a mechanism to customize the log format of
> civetweb. It's already writing IP, HTTP Verb, path, response and time, but
> I'm hoping to get it to print the Authorization header of the request,
> which containers the access key id which we can tie back into the systems
> we use to issue credentials. Any thoughts?
>
> Thanks,
> Aaron
> CONFIDENTIALITY NOTICE
> This e-mail message and any attachments are only for the use of the
> intended recipient and may contain information that is privileged,
> confidential or exempt from disclosure under applicable law. If you are not
> the intended recipient, any disclosure, distribution or other use of this
> e-mail message or attachments is prohibited. If you have received this
> e-mail message in error, please delete and notify the sender immediately.
> Thank you.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Jason Dillaman
On Thu, Mar 8, 2018 at 11:59 AM, Lazuardi Nasution
 wrote:
> Hi Mike,
>
> Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
> multiple gateways. Of course I have to disable any write back cache at any
> level (RBD cache and TGT cache). It seem to be safe to disable exclusive
> lock since each RBD image is accessed only by single client and as long as I
> know mostly ALUA use RR of I/O path.

How do you figure that's safe for preventing an overwrite with old
data in an active/active path hiccup?

> Best regards,
>
> On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:
>>
>> On 03/07/2018 09:24 AM, shadow_lin wrote:
>> > Hi Christie,
>> > Is it safe to use active/passive multipath with krbd with exclusive lock
>> > for lio/tgt/scst/tcmu?
>>
>> No. We tried to use lio and krbd initially, but there is a issue where
>> IO might get stuck in the target/block layer and get executed after new
>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>> add some code tcmu's file_example handler which can be used with krbd so
>> it works like the rbd one.
>>
>> I do know enough about SCST right now.
>>
>>
>> > Is it safe to use active/active multipath If use suse kernel with
>> > target_core_rbd?
>> > Thanks.
>> >
>> > 2018-03-07
>> > 
>> > shadowlin
>> >
>> > 
>> >
>> > *发件人:*Mike Christie 
>> > *发送时间:*2018-03-07 03:51
>> > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>> > Exclusive Lock
>> > *收件人:*"Lazuardi Nasution","Ceph
>> > Users"
>> > *抄送:*
>> >
>> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>> > > Hi,
>> > >
>> > > I want to do load balanced multipathing (multiple iSCSI
>> > gateway/exporter
>> > > nodes) of iSCSI backed with RBD images. Should I disable exclusive
>> > lock
>> > > feature? What if I don't disable that feature? I'm using TGT
>> > (manual
>> > > way) since I get so many CPU stuck error messages when I was using
>> > LIO.
>> > >
>> >
>> > You are using LIO/TGT with krbd right?
>> >
>> > You cannot or shouldn't do active/active multipathing. If you have
>> > the
>> > lock enabled then it bounces between paths for each IO and will be
>> > slow.
>> > If you do not have it enabled then you can end up with stale IO
>> > overwriting current data.
>> >
>> >
>> >
>> >
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Lazuardi Nasution
Hi Mike,

Since I have moved from LIO to TGT, I can do full ALUA (active/active) of
multiple gateways. Of course I have to disable any write back cache at any
level (RBD cache and TGT cache). It seem to be safe to disable exclusive
lock since each RBD image is accessed only by single client and as long as
I know mostly ALUA use RR of I/O path.

Best regards,

On Mar 8, 2018 11:54 PM, "Mike Christie"  wrote:

> On 03/07/2018 09:24 AM, shadow_lin wrote:
> > Hi Christie,
> > Is it safe to use active/passive multipath with krbd with exclusive lock
> > for lio/tgt/scst/tcmu?
>
> No. We tried to use lio and krbd initially, but there is a issue where
> IO might get stuck in the target/block layer and get executed after new
> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
> add some code tcmu's file_example handler which can be used with krbd so
> it works like the rbd one.
>
> I do know enough about SCST right now.
>
>
> > Is it safe to use active/active multipath If use suse kernel with
> > target_core_rbd?
> > Thanks.
> >
> > 2018-03-07
> > 
> > shadowlin
> >
> > 
> >
> > *发件人:*Mike Christie 
> > *发送时间:*2018-03-07 03:51
> > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
> > Exclusive Lock
> > *收件人:*"Lazuardi Nasution","Ceph
> > Users"
> > *抄送:*
> >
> > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
> > > Hi,
> > >
> > > I want to do load balanced multipathing (multiple iSCSI
> gateway/exporter
> > > nodes) of iSCSI backed with RBD images. Should I disable exclusive
> lock
> > > feature? What if I don't disable that feature? I'm using TGT
> (manual
> > > way) since I get so many CPU stuck error messages when I was using
> LIO.
> > >
> >
> > You are using LIO/TGT with krbd right?
> >
> > You cannot or shouldn't do active/active multipathing. If you have
> the
> > lock enabled then it bounces between paths for each IO and will be
> slow.
> > If you do not have it enabled then you can end up with stale IO
> > overwriting current data.
> >
> >
> >
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-08 Thread Mike Christie
On 03/07/2018 09:24 AM, shadow_lin wrote:
> Hi Christie,
> Is it safe to use active/passive multipath with krbd with exclusive lock
> for lio/tgt/scst/tcmu?

No. We tried to use lio and krbd initially, but there is a issue where
IO might get stuck in the target/block layer and get executed after new
IO. So for lio, tgt and tcmu it is not safe as is right now. We could
add some code tcmu's file_example handler which can be used with krbd so
it works like the rbd one.

I do know enough about SCST right now.


> Is it safe to use active/active multipath If use suse kernel with
> target_core_rbd?
> Thanks.
>  
> 2018-03-07
> 
> shadowlin
>  
> 
> 
> *发件人:*Mike Christie 
> *发送时间:*2018-03-07 03:51
> *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
> Exclusive Lock
> *收件人:*"Lazuardi Nasution","Ceph
> Users"
> *抄送:*
>  
> On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: 
> > Hi, 
> >  
> > I want to do load balanced multipathing (multiple iSCSI 
> gateway/exporter 
> > nodes) of iSCSI backed with RBD images. Should I disable exclusive lock 
> > feature? What if I don't disable that feature? I'm using TGT (manual 
> > way) since I get so many CPU stuck error messages when I was using LIO. 
> >  
>  
> You are using LIO/TGT with krbd right? 
>  
> You cannot or shouldn't do active/active multipathing. If you have the 
> lock enabled then it bounces between paths for each IO and will be slow. 
> If you do not have it enabled then you can end up with stale IO 
> overwriting current data. 
>  
>  
>  
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore bluestore_prefer_deferred_size and WAL size

2018-03-08 Thread Budai Laszlo

Dear all,

I'm reading about the bluestore_prefer_deferred_size parameter for Bluestore. 
Are there any hints about its size when using a dedicated SSD for bock.wal and 
block.db ?

Thank you in advance!

Laszlo  
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change radosgw object owner

2018-03-08 Thread Ryan Leimenstoll
Hi Robin, 

Thanks for the pointer! My one concern though is that it didn’t seem to update 
the original object owner’s quota however, which is a bit of a sticking point. 
Is this expected (and is there a workaround)? I will admit to being a bit naive 
to how radosgw’s quota system works under the hood. 

Thanks,
Ryan

> On Mar 6, 2018, at 2:54 PM, Robin H. Johnson  wrote:
> 
> On Tue, Mar 06, 2018 at 02:40:11PM -0500, Ryan Leimenstoll wrote:
>> Hi all, 
>> 
>> We are trying to move a bucket in radosgw from one user to another in an 
>> effort both change ownership and attribute the storage usage of the data to 
>> the receiving user’s quota. 
>> 
>> I have unlinked the bucket and linked it to the new user using: 
>> 
>> radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
>> radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID 
>> —uid=$NEWUSER
>> 
>> However, perhaps as expected, the owner of all the objects in the
>> bucket remain as $USER. I don’t believe changing the owner is a
>> supported operation from the S3 protocol, however it would be very
>> helpful to have the ability to do this on the radosgw backend. This is
>> especially useful for large buckets/datasets where copying the objects
>> out and into radosgw could be time consuming.
> At the raw radosgw-admin level, you should be able to do it with
> bi-list/bi-get/bi-put. The downside here is that I don't think the BI ops are
> exposed in the HTTP Admin API, so it's going to be really expensive to chown
> lots of objects.
> 
> Using a quick example:
> # radosgw-admin \
>  --uid UID-CENSORED \
>  --bucket BUCKET-CENSORED \
>  bi get \
>  --object=OBJECTNAME-CENSORED
> {
>"type": "plain",
>"idx": "OBJECTNAME-CENSORED",
>"entry": {
>"name": "OBJECTNAME-CENSORED",
>"instance": "",
>"ver": {
>"pool": 5,
>"epoch": 266028
>},
>"locator": "",
>"exists": "true",
>"meta": {
>"category": 1,
>"size": 1066,
>"mtime": "2016-11-17 17:01:29.668746Z",
>"etag": "e7a75c39df3d123c716d5351059ad2d9",
>"owner": "UID-CENSORED",
>"owner_display_name": "UID-CENSORED",
>"content_type": "image/png",
>"accounted_size": 1066,
>"user_data": ""
>},
>"tag": "default.293024600.1188196",
>"flags": 0,
>"pending_map": [],
>"versioned_epoch": 0
>}
> }
> 
> -- 
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Dietmar Rieder
Hi,

I noticed in my client (using cephfs) logs that an osd was unexpectedly
going down.
While checking the osd logs for the affected OSD I found that the osd
was seg faulting:

[]
2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7fd9af370700 thread_name:safe_timer

  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
luminous (stable)
   1: (()+0xa3c611) [0x564585904611]
2: (()+0xf5e0) [0x7fd9b66305e0]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
[...]

Should I open a ticket for this? What additional information is needed?


I put the relevant log entries for download under [1], so maybe someone
with more
experience can find some useful information therein.

Thanks
  Dietmar


[1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at





signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object Gateway - Server Side Encryption

2018-03-08 Thread Amardeep Singh

Hi,

I am trying to configure server side encryption using Key Management 
Service as per documentation 
http://docs.ceph.com/docs/master/radosgw/encryption/


Configured Keystone/Barbican integration and its working, tested using 
curl commands. After I configure RadosGW and use boto.s3.connection from 
python or s3cmd client an error is thrown.

*
*/boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden//
//encoding="UTF-8"?>AccessDeniedFailed to 
retrieve the actual key, kms-keyid: 
616b2ce2-053a-41e3-b51e-0ff53e33cf81newbuckettx77750-005aa1274b-ac51-uk-westac51-uk-west-uk//

/
In server side logs its getting the token and barbican is authenticating 
the request then providing secret url, but unable to serve key.

/
22:10:03.940091 7f056f7eb700 15 ceph_armor ret=16
 22:10:03.940111 7f056f7eb700 15 
supplied_md5=eb1a3227cdc3fedbaec2fe38bf6c044a
 22:10:03.940129 7f056f7eb700 20 reading from 
uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1
 22:10:03.940138 7f056f7eb700 20 get_system_obj_state: 
rctx=0x7f056f7e39f0 
obj=uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
state=0x56540487a5a0 s->prefetch_data=0
 22:10:03.940145 7f056f7eb700 10 cache get: 
name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
: hit (requested=0x16, cached=0x17)
 22:10:03.940152 7f056f7eb700 20 get_system_obj_state: s->obj_tag was 
set empty
 22:10:03.940155 7f056f7eb700 10 cache get: 
name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
: hit (requested=0x11, cached=0x17)
 22:10:03.944015 7f056f7eb700 20 bucket quota: max_objects=1638400 
max_size=-1
 22:10:03.944030 7f056f7eb700 20 bucket quota OK: stats.num_objects=7 
stats.size=50
 22:10:03.944176 7f056f7eb700 20 Getting KMS encryption key for 
key=616b2ce2-053a-41e3-b51e-0ff53e33cf81
 22:10:03.944225 7f056f7eb700 20 Requesting secret from barbican 
url=http://keyserver.rados:5000/v3/auth/tokens
 22:10:03.944281 7f056f7eb700 20 sending request to 
http://keyserver.rados:5000/v3/auth/tokens
* 22:10:04.405974 7f056f7eb700 20 sending request to 
http://keyserver.rados:9311/v1/secrets/616b2ce2-053a-41e3-b51e-0ff53e33cf81*
* 22:10:05.519874 7f056f7eb700 5 Failed to retrieve secret from 
barbican:616b2ce2-053a-41e3-b51e-0ff53e33cf81**
** 22:10:05.519901 7f056f7eb700 5 ERROR: failed to retrieve actual key 
from key_id: 616b2ce2-053a-41e3-b51e-0ff53e33cf81*
 22:10:05.519980 7f056f7eb700 2 req 387:1.581432:s3:PUT 
/encrypted.txt:put_obj:completing
 22:10:05.520187 7f056f7eb700 2 req 387:1.581640:s3:PUT 
/encrypted.txt:put_obj:op status=-13
 22:10:05.520193 7f056f7eb700 2 req 387:1.581645:s3:PUT 
/encrypted.txt:put_obj:http status=403
 22:10:05.520206 7f056f7eb700 1 == req done req=0x7f056f7e5190 op 
status=-13 http_status=403 ==

 22:10:05.520225 7f056f7eb700 20 process_request() returned -13
 22:10:05.520280 7f056f7eb700 1 civetweb: 0x5654042a1000: 
192.168.100.200 - - [02/Mar/2018:22:10:03 +0530] "PUT /encrypted.txt 
HTTP/1.1" 1 0 - Boto/2.38.0 Python/2.7.12 Linux/4.12.1-041201-generic

 22:10:06.116527 7f056e7e9700 20 HTTP_ACCEPT=*/*/

The error thrown in from this line 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_crypt.cc#L1063


I am unable to understand why its throwing the error.

In ceph.conf following settings are done.

[global]
rgw barbican url = http://keyserver.rados:9311
rgw keystone barbican user = rgwcrypt
rgw keystone barbican password = rgwpass
rgw keystone barbican project = service
rgw keystone barbican domain = default
rgw keystone url = http://keyserver.rados:5000
rgw keystone api version = 3
rgw crypt require ssl = false

Can someone help in figuring out what is missing.

Thanks,
Amar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent

2018-03-08 Thread Harald Staub

Hi Brad

Thank you very much for your attention.

On 07.03.2018 23:46, Brad Hubbard wrote:

On Thu, Mar 8, 2018 at 1:22 AM, Harald Staub  wrote:

"ceph pg repair" leads to:
5.7bd repair 2 errors, 0 fixed

Only an empty list from:
rados list-inconsistent-obj 5.7bd --format=json-pretty

Inspired by http://tracker.ceph.com/issues/12577 , I tried again with more
verbose logging and searched the osd logs e.g. for "!=", "mismatch", could
not find anything interesting. Oh well, these are several millions of lines
...

Any hint what I could look for?


Try searching for "scrub_compare_maps" and looking for "5.7bd" in that context.


These lines (from the primary OSD) may be interesting:

2018-03-07 14:20:31.405120 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: error(s) osd 442 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7, 
object_info_inconsistency
2018-03-07 14:20:31.405134 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: selecting osd 340 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7 with 
oi 
5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7(505072'35716889 
osd.340.0:258067 dirty|data_digest|omap_digest s 4194304 uv 35452964 dd 
68383c60 od  alloc_hint [0 0 0])
2018-03-07 14:20:31.405172 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: error(s) osd 442 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:head, 
snapset_inconsistency object_info_inconsistency
2018-03-07 14:20:31.405404 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] scrub_snapshot_metadata (repair) finish
2018-03-07 14:20:31.405413 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] scrub_compare_maps: discarding scrub results


Then I had another idea. The inconsistency errors were triggered by a 
scrub, not a deep scrub. So I triggered another scrub:


ceph pg scrub 5.7bd

And the problem got fixed.

Cheers
 Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread Brad Hubbard
On Thu, Mar 8, 2018 at 7:33 PM, 赵赵贺东  wrote:
> Hi Brad,
>
> Thank you for your attention.
>
>> 在 2018年3月8日,下午4:47,Brad Hubbard  写道:
>>
>> On Thu, Mar 8, 2018 at 5:01 PM, 赵贺东  wrote:
>>> Hi All,
>>>
>>> Every time after we activate osd, we got “Structure needs cleaning” in 
>>> /var/lib/ceph/osd/ceph-xxx/current/meta.
>>>
>>>
>>> /var/lib/ceph/osd/ceph-xxx/current/meta
>>> # ls -l
>>> ls: reading directory .: Structure needs cleaning
>>> total 0
>>>
>>> Could Anyone say something about this error?
>>
>> It's an indication of possible corruption on the filesystem containing 
>> "meta".
>>
>> Can you unmount it and run a filesystem check on it?
> I did some xfs_repair operation, but no effect.Structure needs cleaning” 
> still exist.
>
>
>
>>
>> At the time the filesystem first detected the corruption it would have
>> logged it to dmesg and possibly syslog which may give you a clue. Did
>> you lose power or have a kernel panic or something?
> We did not lose power.
> You are right, we get a metadata corruption in dmesg every time just 
> following the osd activating operation.
>
> [  399.513525] XFS (sda1): Metadata corruption detected at 
> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
> [  399.524709] XFS (sda1): Unmount and run xfs_repair
> [  399.529511] XFS (sda1): First 64 bytes of corrupted metadata buffer:
> [  399.535917] dd8f2000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
> XFSB.s..
> [  399.543959] dd8f2010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> [  399.551983] dd8f2020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
> .0@"Q.O..sV.q..$
> [  399.560037] dd8f2030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
> 
> [  399.568118] XFS (sda1): metadata I/O error: block 0x48b9ff80 
> ("xfs_trans_read_buf_map") error 117 numblks 8
> [  399.583179] XFS (sda1): Metadata corruption detected at 
> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
> [  399.594378] XFS (sda1): Unmount and run xfs_repair
> [  399.599182] XFS (sda1): First 64 bytes of corrupted metadata buffer:
> [  399.605575] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
> XFSB.s..
> [  399.613613] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> [  399.621637] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
> .0@"Q.O..sV.q..$
> [  399.629679] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
> 
> [  399.637856] XFS (sda1): metadata I/O error: block 0x48b9ff80 
> ("xfs_trans_read_buf_map") error 117 numblks 8
> [  399.648165] XFS (sda1): Metadata corruption detected at 
> xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
> [  399.659378] XFS (sda1): Unmount and run xfs_repair
> [  399.664196] XFS (sda1): First 64 bytes of corrupted metadata buffer:
> [  399.670570] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
> XFSB.s..
> [  399.678610] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> [  399.686643] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
> .0@"Q.O..sV.q..$
> [  399.694681] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  
> 
> [  399.702794] XFS (sda1): metadata I/O error: block 0x48b9ff80 
> ("xfs_trans_read_buf_map") error 117 numblks 8

I'd suggest the next step is to look for a matching XFS bug in your
distro and, if possible, try a different distro and see if you get the
same result.

>
>
> Thank you !
>
>
>>
>>>
>>> Thank you!
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread 赵贺东
Hi Wido,

Thank you for attention!
> 在 2018年3月8日,下午4:21,Wido den Hollander  写道:
> 
> 
> 
> On 03/08/2018 08:01 AM, 赵贺东 wrote:
>> Hi All,
>> Every time after we activate osd, we got “Structure needs cleaning” in 
>> /var/lib/ceph/osd/ceph-xxx/current/meta.
>> /var/lib/ceph/osd/ceph-xxx/current/meta
>> # ls -l
>> ls: reading directory .: Structure needs cleaning
>> total 0
>> Could Anyone say something about this error?
> 
> Seems like XFS is broken. I recommend that you wipe that OSD and reformat it 
> with ceph-disk/ceph-volume.
Because our ceph is run on ubuntu14.04, ceph-volume needs systemd(systemd only 
on ubuntu16.04) support.
It makes things more complicated 

 if we want to use ceph-volume.

> Also check the SMART values and verify that the disk isn't broken.
Because, every disk has the same problem.And only triggered by osd activating 
operation.
If I deploy osd manually , I can see the “Structure needs cleaning” will come 
out just after I try to start ceph-osd daemon;
It means, after I ran “start ceph-osd id=”, and then "ls -l  
/var/lib/ceph/osd/ceph-xxx/current/meta” , there will be “Structure needs 
cleaning"
 
> 
> Do not attempt a XFS repair or something.
Yes xfs_repair has no effect.

Thank you!
> 
> Wido
> 
>> Thank you!
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread 赵赵贺东
Hi Brad,

Thank you for your attention.

> 在 2018年3月8日,下午4:47,Brad Hubbard  写道:
> 
> On Thu, Mar 8, 2018 at 5:01 PM, 赵贺东  wrote:
>> Hi All,
>> 
>> Every time after we activate osd, we got “Structure needs cleaning” in 
>> /var/lib/ceph/osd/ceph-xxx/current/meta.
>> 
>> 
>> /var/lib/ceph/osd/ceph-xxx/current/meta
>> # ls -l
>> ls: reading directory .: Structure needs cleaning
>> total 0
>> 
>> Could Anyone say something about this error?
> 
> It's an indication of possible corruption on the filesystem containing "meta".
> 
> Can you unmount it and run a filesystem check on it?
I did some xfs_repair operation, but no effect.Structure needs cleaning” still 
exist.



> 
> At the time the filesystem first detected the corruption it would have
> logged it to dmesg and possibly syslog which may give you a clue. Did
> you lose power or have a kernel panic or something?
We did not lose power.
You are right, we get a metadata corruption in dmesg every time just 
following the osd activating operation.

[  399.513525] XFS (sda1): Metadata corruption detected at 
xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
[  399.524709] XFS (sda1): Unmount and run xfs_repair
[  399.529511] XFS (sda1): First 64 bytes of corrupted metadata buffer:
[  399.535917] dd8f2000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
XFSB.s..
[  399.543959] dd8f2010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

[  399.551983] dd8f2020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
.0@"Q.O..sV.q..$
[  399.560037] dd8f2030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  

[  399.568118] XFS (sda1): metadata I/O error: block 0x48b9ff80 
("xfs_trans_read_buf_map") error 117 numblks 8
[  399.583179] XFS (sda1): Metadata corruption detected at 
xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
[  399.594378] XFS (sda1): Unmount and run xfs_repair
[  399.599182] XFS (sda1): First 64 bytes of corrupted metadata buffer:
[  399.605575] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
XFSB.s..
[  399.613613] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

[  399.621637] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
.0@"Q.O..sV.q..$
[  399.629679] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  

[  399.637856] XFS (sda1): metadata I/O error: block 0x48b9ff80 
("xfs_trans_read_buf_map") error 117 numblks 8
[  399.648165] XFS (sda1): Metadata corruption detected at 
xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
[  399.659378] XFS (sda1): Unmount and run xfs_repair
[  399.664196] XFS (sda1): First 64 bytes of corrupted metadata buffer:
[  399.670570] e47db000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  
XFSB.s..
[  399.678610] e47db010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

[  399.686643] e47db020: e5 30 40 22 51 8f 4f 1c 80 73 56 9b 71 aa 92 24  
.0@"Q.O..sV.q..$
[  399.694681] e47db030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  

[  399.702794] XFS (sda1): metadata I/O error: block 0x48b9ff80 
("xfs_trans_read_buf_map") error 117 numblks 8


Thank you !


> 
>> 
>> Thank you!
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Cheers,
> Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 19th April 2018: Ceph/Apache CloudStack day in London

2018-03-08 Thread Wido den Hollander

Hello Ceph (and CloudStack ;-) ) people!

Together with the Apache CloudStack [0] project we are organizing a Ceph 
Day in London on April 19th this year.


As there are many users using Apache CloudStack with Ceph as the storage 
behind their Virtual Machines or using Ceph as a object store in 
addition to their VM offering we thought it would be great to bring both 
communities together.


More information can be found on ceph.com [1] and you can already 
register [2] as well.


There will be a mix of talks, Ceph specific, CloudStack specific and 
talks touching both projects.


The morning will be a single track where the afternoon splits into two 
tracks, Ceph and CloudStack.


Capacity is limited and we have registrations coming in with a steady 
flow, so please register soon if you want to attend!


Looking forward to see a big group of Ceph and CloudStack users and fans 
showing up in London!


Wido

[0]: https://cloudstack.apache.org/
[1]: https://ceph.com/cephdays/london/
[2]: 
https://www.eventbrite.co.uk/e/cloudstack-european-user-group-ceph-day-tickets-42670526694

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] improve single job sequencial read performance.

2018-03-08 Thread Cassiano Pilipavicius
Hi Alex... thank you for the tips! Yesterday I've made a lot of testing 
and it seems that my network is really what is holding the speed down. I 
just like to confirm if this is not really a problem or 
misconfigurantion in my cluster that would be masked by the network 
upgrade. The cache make the things really better, the second time I read 
a file, even if I drop the caches at the guest OS, the data is read at 
the netwotk speed limit.


I dont know if it is normal, but in the tests  with fio the KRDB shows a 
great performance boost over librbd (50MB/s in KRBD, 28MB/s in librbd).


To check how much the network latency is slowing down things, I've 
created a 4xSSD only pool with size 2, and set the osds on one host to 
primary-affinity 0, when I run the test with this config, data was read 
at 900MB/s and clat is under 1ms. When I turned primary-affinity to 1 
again and run the same test, the bw dropped to 100MB/s only and the 
higher clat is 250ms and the average 70ms.


I will post the difference on the speeds next week when I have the 
network upgraded in the case anyone like to see the results.



Em 3/7/2018 10:38 PM, Alex Gorbachev escreveu:

On Wed, Mar 7, 2018 at 8:37 PM, Alex Gorbachev  wrote:

On Wed, Mar 7, 2018 at 9:43 AM, Cassiano Pilipavicius
 wrote:

Hi all, this issue already have been discussed in older threads and I've
already tried most of the solutions proposed in older threads.


I have a small and  old ceph cluster (slarted in hammer and upgraded until
luminous 12.2.2) , connected thru single 1gbe link shared (I know this is
not optimal but for my workload it is handling the load reasonably well). I
use for RBD for small VMs in libvirtu/qemu.

My problem is... If i need to copy a large file (cp, dd, tar), the read
speed is very low (15MB/s). I've tested the write speed of a single job with
dd zero (direct) > file and the speed is good enought for my environment
(80MB/s)

If I run paralell jobs, I can saturate the network connection, the speed
scales with the number of jobs. I've tried setting read ahead on ceph.conf
and in the guest O.S

I've never heard any report of a cluster using single 1gbe, maybe this speed
is what should I expect? The next week I will be upgrading the network for 2
x 10gbe (private and public) but I would like to know if I have any issue
that I need to address before, as the problem can be masked by the network
upgrade.

If anyone can throw some light or point me in any direction or tell me
this is what you should expect I really apreciate. If anyone need more
info please let me know.

Workarounds I have heard of or used:

1. Use fancy striping and parallelize that way
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017744.html

2. Use lvm and set up a striped volume over multiple RBDs

3. Weird but we had seen improvement in sequential speeds with larger
object size (16 MB) in the past

4. Caching solutions may help smooth out peaks and valleys of IO -
bcache, flashcache and we have successfully used EnhanceIO with
writethrough mode

5. Better SSD journals help if using filestore

6. Caching controllers, e.g. Areca

--
Alex Gorbachev
Storcium



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread Wido den Hollander



On 03/08/2018 08:01 AM, 赵贺东 wrote:

Hi All,

Every time after we activate osd, we got “Structure needs cleaning” in 
/var/lib/ceph/osd/ceph-xxx/current/meta.


/var/lib/ceph/osd/ceph-xxx/current/meta
# ls -l
ls: reading directory .: Structure needs cleaning
total 0

Could Anyone say something about this error?



Seems like XFS is broken. I recommend that you wipe that OSD and 
reformat it with ceph-disk/ceph-volume.


Also check the SMART values and verify that the disk isn't broken.

Do not attempt a XFS repair or something.

Wido


Thank you!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/osd/ceph-xxx/current/meta shows "Structure needs cleaning"

2018-03-08 Thread Brad Hubbard
On Thu, Mar 8, 2018 at 5:01 PM, 赵贺东  wrote:
> Hi All,
>
> Every time after we activate osd, we got “Structure needs cleaning” in 
> /var/lib/ceph/osd/ceph-xxx/current/meta.
>
>
> /var/lib/ceph/osd/ceph-xxx/current/meta
> # ls -l
> ls: reading directory .: Structure needs cleaning
> total 0
>
> Could Anyone say something about this error?

It's an indication of possible corruption on the filesystem containing "meta".

Can you unmount it and run a filesystem check on it?

At the time the filesystem first detected the corruption it would have
logged it to dmesg and possibly syslog which may give you a clue. Did
you lose power or have a kernel panic or something?

>
> Thank you!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore questions

2018-03-08 Thread Caspar Smit
Hi Frank,


2018-03-04 1:40 GMT+01:00 Frank Ritchie :

> Hi all,
>
> I have a few questions on using BlueStore.
>
> With FileStore it is not uncommon to see 1 nvme device being used as the
> journal device for up to 12 OSDs.
>
>
Can an adequately sized nvme device also be used as the wal/db device for
> up to 12 OSDs?
>
>
Well, you could ask yourself the question: What's the impact of losing
those 12 OSD's at once when the nvme device fails? If the OSD's are slow
spinners you probably won't have to worry about performance using the nmve
device as wal/db.

Are there any rules of thumb for sizing wal/db?
>
>
As David Turner mentioned on a previous post, you would be ok with using
around 10GB per TB of OSD. So 40GB wal/db partition for a 4TB OSD.
This is no hard 'rule' to follow but will likely be enough to avoid
spillover of the DB to the spinner disks. The actual size of the DB is
depending on a few factors such as number of objects.

Hope this helps,
Caspar


> Would love to hear some actual numbers from users.
>
> thx
> Frank
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com