Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-21 Thread Michael Green
I was informed today that the CEPH environment I’ve been working on is no 
longer available. Unfortunately this happened before I could try any of your 
suggestions, Roman. 

Thank you for all the attention and advice. 

--
Michael Green


> On Dec 20, 2018, at 08:21, Roman Penyaev  wrote:
> 
>> On 2018-12-19 22:01, Marc Roos wrote:
>> I would be interested learning about the performance increase it has
>> compared to 10Gbit. I got the ConnectX-3 Pro but I am not using the rdma
>> because support is not default available.
> 
> Not too much, the following is the comparison on latest master using
> fio engine, which measures bare ceph messenger performance (no disk IO):
> https://github.com/ceph/ceph/pull/24678
> 
> 
> Mellanox MT27710 Family [ConnectX-4 Lx] 25gb/s:
> 
> 
>  bsiodepth=8,  async+posix   iodepth=8,  async+rdma
> - 
> --
>  4kIOPS=30.0k  BW=121MiB/s   0.257ms IOPS=47.9k  BW=187MiB/s  0.166ms
>  8kIOPS=30.8k  BW=240MiB/s   0.259ms IOPS=46.3k  BW=362MiB/s  0.172ms
> 16kIOPS=25.1k  BW=392MiB/s   0.318ms IOPS=45.2k  BW=706MiB/s  0.176ms
> 32kIOPS=23.1k  BW=722MiB/s   0.345ms IOPS=37.5k  BW=1173MiB/s 0.212ms
> 64kIOPS=18.0k  BW=1187MiB/s  0.420ms IOPS=41.0k  BW=2624MiB/s 0.189ms
> 128kIOPS=12.1k  BW=1518MiB/s  0.657ms IOPS=20.9k  BW=2613MiB/s 0.381ms
> 256kIOPS=3530   BW=883MiB/s   2.265ms IOPS=4624   BW=1156MiB/s 1.729ms
> 512kIOPS=2084   BW=1042MiB/s  3.387ms IOPS=2406   BW=1203MiB/s  3.32ms
>  1mIOPS=1119   BW=1119MiB/s  7.145ms IOPS=1277   BW=1277MiB/s  6.26ms
>  2mIOPS=551BW=1101MiB/s  14.51ms IOPS=631BW=1263MiB/s 12.66ms
>  4mIOPS=272BW=1085MiB/s  29.45ms IOPS=318BW=1268MiB/s 25.17ms
> 
> 
> 
>  bsiodepth=128,  async+posix   iodepth=128,  async+rdma
> - 
> --
>  4kIOPS=75.9k  BW=297MiB/s  1.683ms  IOPS=83.4k  BW=326MiB/s   1.535ms
>  8kIOPS=64.3k  BW=502MiB/s  1.989ms  IOPS=70.3k  BW=549MiB/s   1.819ms
> 16kIOPS=53.9k  BW=841MiB/s  2.376ms  IOPS=57.8k  BW=903MiB/s   2.214ms
> 32kIOPS=42.2k  BW=1318MiB/s 3.034ms  IOPS=59.4k  BW=1855MiB/s  2.154ms
> 64kIOPS=30.0k  BW=1934MiB/s 4.135ms  IOPS=42.3k  BW=2645MiB/s  3.023ms
> 128kIOPS=18.1k  BW=2268MiB/s 7.052ms  IOPS=21.2k  BW=2651MiB/s  
> 6.031ms
> 256kIOPS=5186   BW=1294MiB/s 24.71ms  IOPS=5253   BW=1312MiB/s  
> 24.39ms
> 512kIOPS=2897   BW=1444MiB/s 44.19ms  IOPS=2944   BW=1469MiB/s  
> 43.48ms
>  1mIOPS=1306   BW=1297MiB/s 97.98ms  IOPS=1421   BW=1415MiB/s  90.27ms
>  2mIOPS=612BW=1199MiB/s 208.6ms  IOPS=862BW=1705MiB/s  148.9ms
>  4mIOPS=316BW=1235MiB/s 409.1ms  IOPS=416BW=1664MiB/s  307.4ms
> 
> 
> 1. As you can see there is no big difference between posix and rdma.
> 
> 2. Even 25gb/s card is used we barely reach 20gb/s.  I have also results
>   on 100gb/s qlogic cards, no difference, because the bottleneck is not
>   a network.  This is especially visible on loads with bigger number of
>   iopdeth: bandwidth is not significantly changed. So even you increase
>   number of requests in-flight you reach the limit how fast those
>   requests are processed.
> 
> 3. Keep in mind this is only messenger performance, so on real ceph loads you
>   will get less, because of the whole IO stack involved.
> 
> 
> --
> Roman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file block size: must it be so big?

2018-12-21 Thread Gregory Farnum
On Fri, Dec 14, 2018 at 6:44 PM Bryan Henderson 
wrote:

> > Going back through the logs though it looks like the main reason we do a
> > 4MiB block size is so that we have a chance of reporting actual cluster
> > sizes to 32-bit systems,
>
> I believe you're talking about a different block size (there are so many of
> them).
>
> The 'statvfs' system call (the essence of a 'df' command) can return its
> space
> sizes in any units it wants, and tells you that unit.  The unit has
> variously
> been called block size and fragment size.  In Cephfs, it is hardcoded as 4
> MiB
> so that 32 bit fields can represent large storage sizes.  I'm not aware
> that
> anyone attempts to use that value for anything but interpreting statvfs
> results.  Not saying they don't, though.
>
> What I'm looking at, in contrast, is the block size returned by a 'stat'
> system call on a particular file.  In Cephfs, it's the stripe unit size for
> the file, which is an aspect of the file's layout.  In the default layout,
> stripe unit size is 4 MiB.
>

You are of course correct; sorry for the confusion.
It looks like this was introduced in (user space) commit
0457783f6eb0c41951b6d56a568eccaeccec8e6d, which swapped it from the
previous hard-coded 4096. Probably in the expectation that there might be
still-small stripe units that were nevertheless useful to do IO in terms of.

You might want to try and be more sophisticated than just having a mount
option to override the reported block size — perhaps forcing the reported
size within some reasonable limits, but trying to keep some relationship
between it and the stripe size? If someone deploys an erasure-coded pool
under CephFS they definitely want to be doing IO in the stripe size if
possible, rather than 4 or 8KiB.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Brad Hubbard
Can you provide the complete OOM message from the dmesg log?

On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri  wrote:
>
>
> Thank You for the quick response Dyweni!
>
> We are using FileStore as this cluster is upgraded from 
> Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd 
> has 128GB and R740xd has 96GB of RAM. Everything else is the same.
>
> Thanks,
> Pardhiv Karri
>
> On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> 
> wrote:
>>
>> Hi,
>>
>>
>> You could be running out of memory due to the default Bluestore cache sizes.
>>
>>
>> How many disks/OSDs in the R730xd versus the R740xd?  How much memory in 
>> each server type?  How many are HDD versus SSD?  Are you running Bluestore?
>>
>>
>> OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", 
>> since the kernel-provided page-cache is not available to Bluestore.  
>> Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of 
>> memory for each SSD.  OSD's do not allocate all that memory up front, but 
>> grow into it as it is used.  This cache is in addition to any other memory 
>> the OSD uses.
>>
>>
>> Check out the bluestore_cache_* values (these are specified in bytes) in the 
>> manual cache sizing section of the docs 
>> (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
>>Note that the automatic cache sizing feature wasn't added until 12.2.9.
>>
>>
>>
>> As an example, I have OSD's running on 32bit/armhf nodes.  These nodes have 
>> 2GB of memory.  I run 1 Bluestore OSD on each node.  In my ceph.conf file, I 
>> have 'bluestore cache size = 536870912' and 'bluestore cache kv max = 
>> 268435456'.  I see aprox 1.35-1.4 GB used by each OSD.
>>
>>
>>
>>
>> On 2018-12-21 15:19, Pardhiv Karri wrote:
>>
>> Hi,
>>
>> We have a luminous cluster which was upgraded from Hammer --> Jewel --> 
>> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes 
>> where they are running out of memory and dying. In the logs we are seeing 
>> OOM killer. We don't have this issue before upgrade. The only difference is 
>> the nodes without any issue are R730xd and the ones with the memory leak are 
>> R740xd. The hardware vendor don't see anything wrong with the hardware. From 
>> Ceph end we are not seeing any issue when it comes to running the cluster, 
>> only issue is with memory leak. Right now we are actively rebooting the 
>> nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs 
>> to 0.0 and there is no memory leak there. Any pointers to fix the issue 
>> would be helpful.
>>
>> Thanks,
>> Pardhiv Karri
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Pardhiv Karri
> "Rise and Rise again until LAMBS become LIONS"
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Pardhiv Karri
Thank You for the quick response Dyweni!

We are using FileStore as this cluster is upgraded from
Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd
has 128GB and R740xd has 96GB of RAM. Everything else is the same.

Thanks,
Pardhiv Karri

On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> Hi,
>
>
> You could be running out of memory due to the default Bluestore cache
> sizes.
>
>
> How many disks/OSDs in the R730xd versus the R740xd?  How much memory in
> each server type?  How many are HDD versus SSD?  Are you running Bluestore?
>
>
> OSD's in Luminous, which run Bluestore, allocate memory to use as a
> "cache", since the kernel-provided page-cache is not available to
> Bluestore.  Bluestore, by default, will use 1GB of memory for each HDD, and
> 3GB of memory for each SSD.  OSD's do not allocate all that memory up
> front, but grow into it as it is used.  This cache is in addition to any
> other memory the OSD uses.
>
>
> Check out the bluestore_cache_* values (these are specified in bytes) in
> the manual cache sizing section of the docs (
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
>  Note that the automatic cache sizing feature wasn't added until 12.2.9.
>
>
>
> As an example, I have OSD's running on 32bit/armhf nodes.  These nodes
> have 2GB of memory.  I run 1 Bluestore OSD on each node.  In my ceph.conf
> file, I have 'bluestore cache size = 536870912' and 'bluestore cache kv max
> = 268435456'.  I see aprox 1.35-1.4 GB used by each OSD.
>
>
>
>
> On 2018-12-21 15:19, Pardhiv Karri wrote:
>
> Hi,
>
> We have a luminous cluster which was upgraded from Hammer --> Jewel -->
> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes
> where they are running out of memory and dying. In the logs we are seeing
> OOM killer. We don't have this issue before upgrade. The only difference is
> the nodes without any issue are R730xd and the ones with the memory leak
> are R740xd. The hardware vendor don't see anything wrong with the hardware.
> From Ceph end we are not seeing any issue when it comes to running the
> cluster, only issue is with memory leak. Right now we are actively
> rebooting the nodes in timely manner to avoid crashes. One R740xd node we
> set all the OSDs to 0.0 and there is no memory leak there. Any pointers to
> fix the issue would be helpful.
>
> Thanks,
> *Pardhiv Karri*
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread Anthony D'Atri
> It'll cause problems if yours the only one NVMe drive will die - you'll lost 
> all the DB partitions and all the OSDs are going to be failed


The severity of this depends a lot on the size of the cluster.  If there are 
only, say, 4 nodes total, for sure the loss of a quarter of the OSDs will be 
somewhere between painful and fatal.  Especially if the subtree limit does not 
forestall rebalancing, and if EC is being used vs replication.  From a pain 
angle, though, this is no worse than if the server itself smokes.

It's easy to say "don't do that" but sometimes one doesn't have a choice:

* Unit economics can confound provisioning of larger/more external metadata 
devices.  I'm sure Vlad isn't using spinners because he hates SSDs.

* Devices have to go somewhere.  It's not uncommon to have a server using 2 
PCIe slots for NICs (1) and another for an HBA, leaving as few as 1 or 0 free.  
Sometimes the potential for a second PCI riser is canceled by the need to 
provision a rear drive cage for OS/boot drives to maximize front-panel bay 
availability.

* Cannibalizing one or more front drive bays for metadata drives can be 
problematic:
- Usable cluster capacity is decreased, along with unit economics
- Dogfood or draconian corporate policy (Herbert! Herbert!) can prohibit this.  
I've personally in the past been prohibited from the obvious choise to use a 
simple open-market LFF to SFF adapter because it wasn't officially "supported" 
and would use components without a corporate SKU.

The 4% guidance was 1% until not all that long ago.  Guidance on calculating 
adequate sizing based on application and workload would be NTH.  I've been told 
that an object storage (RGW) use case can readily get away with less because 
L2/L3/etc are both rarely accessed and the first to be overflowed onto slower 
storage.  And that block (RBD) workloads have different access patterns that 
are more impacted by overflow of higher levels.  As RBD pools increasingly are 
deployed on SSD/NVMe devices, the case for colocating their metadata is strong, 
and obviates having to worry about sizing before deployment.













(1) One of many reasons to seriously consider not having a separate replication 
network





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Dyweni - Ceph-Users
Hi, 

You could be running out of memory due to the default Bluestore cache
sizes. 

How many disks/OSDs in the R730xd versus the R740xd?  How much memory in
each server type?  How many are HDD versus SSD?  Are you running
Bluestore? 

OSD's in Luminous, which run Bluestore, allocate memory to use as a
"cache", since the kernel-provided page-cache is not available to
Bluestore.  Bluestore, by default, will use 1GB of memory for each HDD,
and 3GB of memory for each SSD.  OSD's do not allocate all that memory
up front, but grow into it as it is used.  This cache is in addition to
any other memory the OSD uses. 

Check out the bluestore_cache_* values (these are specified in bytes) in
the manual cache sizing section of the docs
(http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
  Note that the automatic cache sizing feature wasn't added until
12.2.9. 

As an example, I have OSD's running on 32bit/armhf nodes.  These nodes
have 2GB of memory.  I run 1 Bluestore OSD on each node.  In my
ceph.conf file, I have 'bluestore cache size = 536870912' and 'bluestore
cache kv max = 268435456'.  I see aprox 1.35-1.4 GB used by each OSD. 

On 2018-12-21 15:19, Pardhiv Karri wrote:

> Hi, 
> 
> We have a luminous cluster which was upgraded from Hammer --> Jewel --> 
> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes 
> where they are running out of memory and dying. In the logs we are seeing OOM 
> killer. We don't have this issue before upgrade. The only difference is the 
> nodes without any issue are R730xd and the ones with the memory leak are 
> R740xd. The hardware vendor don't see anything wrong with the hardware. From 
> Ceph end we are not seeing any issue when it comes to running the cluster, 
> only issue is with memory leak. Right now we are actively rebooting the nodes 
> in timely manner to avoid crashes. One R740xd node we set all the OSDs to 0.0 
> and there is no memory leak there. Any pointers to fix the issue would be 
> helpful. 
> 
> Thanks, 
> PARDHIV KARRI 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Pardhiv Karri
Hi,

We have a luminous cluster which was upgraded from Hammer --> Jewel -->
Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes
where they are running out of memory and dying. In the logs we are seeing
OOM killer. We don't have this issue before upgrade. The only difference is
the nodes without any issue are R730xd and the ones with the memory leak
are R740xd. The hardware vendor don't see anything wrong with the hardware.
>From Ceph end we are not seeing any issue when it comes to running the
cluster, only issue is with memory leak. Right now we are actively
rebooting the nodes in timely manner to avoid crashes. One R740xd node we
set all the OSDs to 0.0 and there is no memory leak there. Any pointers to
fix the issue would be helpful.

Thanks,
*Pardhiv Karri*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Pardhiv Karri
Thank You Dwyeni for the quick response. We have 2 Hammer which are due for
upgrade to Luminous next month and 1 Luminous 12.2.8. Will try this on
Luminous and if it works then will apply the same once the Hammer clusters
are upgraded rather than adjusting the weights.

Thanks,
Pardhiv Karri

On Fri, Dec 21, 2018 at 1:05 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> Hi,
>
>
> If you are running Ceph Luminous or later, use the Ceph Manager Daemon's
> Balancer module.  (http://docs.ceph.com/docs/luminous/mgr/balancer/).
>
>
> Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you
> achieve uniformity.  (You should be able to get under 1 STDDEV).  I would
> adjust in small amounts to not overload your cluster.
>
>
> Example:
>
> ceph osd reweight osd.X  y.yyy
>
>
>
>
> On 2018-12-21 14:56, Pardhiv Karri wrote:
>
> Hi,
>
> We have Ceph clusters which are greater than 1PB. We are using tree
> algorithm. The issue is with the data placement. If the cluster utilization
> percentage is at 65% then some of the OSDs are already above 87%. We had to
> change the near_full ratio to 0.90 to circumvent warnings and to get back
> the Health to OK state.
>
> How can we keep the OSDs utilization to be in sync with cluster
> utilization (both percentages to be close enough) as we want to utilize the
> cluster to the max (above 80%) without unnecessarily adding too many
> nodes/osd's. Right now we are losing close to 400TB of the disk space
> unused as some OSDs are above 87% and some are below 50%. If the above 87%
> OSDs reach 95% then the cluster will have issues. What is the best way to
> mitigate this issue?
>
> Thanks,
>
> *Pardhiv Karri*
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Dyweni - Ceph-Users
Hi, 

If you are running Ceph Luminous or later, use the Ceph Manager Daemon's
Balancer module.  (http://docs.ceph.com/docs/luminous/mgr/balancer/). 

Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you
achieve uniformity.  (You should be able to get under 1 STDDEV).  I
would adjust in small amounts to not overload your cluster. 

Example: 

ceph osd reweight osd.X  y.yyy 

On 2018-12-21 14:56, Pardhiv Karri wrote:

> Hi, 
> 
> We have Ceph clusters which are greater than 1PB. We are using tree 
> algorithm. The issue is with the data placement. If the cluster utilization 
> percentage is at 65% then some of the OSDs are already above 87%. We had to 
> change the near_full ratio to 0.90 to circumvent warnings and to get back the 
> Health to OK state. 
> 
> How can we keep the OSDs utilization to be in sync with cluster utilization 
> (both percentages to be close enough) as we want to utilize the cluster to 
> the max (above 80%) without unnecessarily adding too many nodes/osd's. Right 
> now we are losing close to 400TB of the disk space unused as some OSDs are 
> above 87% and some are below 50%. If the above 87% OSDs reach 95% then the 
> cluster will have issues. What is the best way to mitigate this issue? 
> 
> Thanks, 
> Pardhiv Karri
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Pardhiv Karri
Hi,

We have Ceph clusters which are greater than 1PB. We are using tree
algorithm. The issue is with the data placement. If the cluster utilization
percentage is at 65% then some of the OSDs are already above 87%. We had to
change the near_full ratio to 0.90 to circumvent warnings and to get back
the Health to OK state.

How can we keep the OSDs utilization to be in sync with cluster utilization
(both percentages to be close enough) as we want to utilize the cluster to
the max (above 80%) without unnecessarily adding too many nodes/osd's.
Right now we are losing close to 400TB of the disk space unused as some
OSDs are above 87% and some are below 50%. If the above 87% OSDs reach 95%
then the cluster will have issues. What is the best way to mitigate this
issue?

Thanks,

*Pardhiv Karri*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Your email to ceph-uses mailing list: Signature check failures.

2018-12-21 Thread Dyweni - Ceph-Users

Hi Cary,

I ran across your email on the ceph-users mailing list 'Signature check 
failures.'.


I've just run across the same issue on my end.  Also Gentoo user here.

Running Ceph 12.2.5... 32bit/armhf  and 64bit/x64_64.


Was your environment mixed or strictly just x86_64?



What is interesting, is that my 32bit/armhf (built with USE="ssl -nss") 
OSDs have no problem talking to themselves, or my 64bit/x86_64 (built 
with USE="-ssl nss") OSDs or my 64bit/x86_64 (built with USE="-ssl nss") 
clients.



Trying to build new 64bit/x86_64 (built with USE="ssl -nss") OSDs and 
getting this same error with a simple 'rbd ls -l'.



OpenSSL version is 1.0.2p.   Do you remember which version of OpenSSL 
you were building against?  'genlop -e openssl' will show you.



The locally calculated signature most times looks really short, so I'm 
wondering if we're hitting some kind of variable size issue... maybe 
overflow too?



Would appreciate any insight you could give.

Thanks!

Dyweni



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible data damage: 1 pg inconsistent

2018-12-21 Thread Frank Ritchie
Christop, do you have any links to the bug?

On Fri, Dec 21, 2018 at 11:07 AM Christoph Adomeit <
christoph.adom...@gatworks.de> wrote:

> Hi,
>
> same here but also for pgs in cephfs pools.
>
> As far as I know there is a known bug that under memory pressure some
> reads return zero
> and this will lead to the error message.
>
> I have set nodeep-scrub and i am waiting for 12.2.11.
>
> Thanks
>   Christoph
>
> On Fri, Dec 21, 2018 at 03:23:21PM +0100, Hervé Ballans wrote:
> > Hi Frank,
> >
> > I encounter exactly the same issue with the same disks than yours. Every
> > day, after a batch of deep scrubbing operation, ther are generally
> between 1
> > and 3 inconsistent pgs, and that, on different OSDs.
> >
> > It could confirm a problem on these disks, but :
> >
> > - it concerns only the pgs of the rbd pool, not those of cephfs pools
> (the
> > same disk model is used)
> >
> > - I encounter this when I was running 12.2.5, not when I upgraded in
> 12.2.8
> > but the problem appears again after upgrade in 12.2.10
> >
> > - On my side, smartctl and dmesg do not show any media error, so I'm
> pretty
> > sure that physical media is not concerned...
> >
> > Small precision: each disk is configured with RAID0 on a PERC740P, is
> this
> > also the case for you or are your disks in JBOD mode ?
> >
> > Another question: in your case, the OSD who is involved in the
> inconsistent
> > pgs is it always the same one or is it a new one every time ?
> >
> > For information, currently, the manually 'ceph pg repair' command works
> well
> > each time...
> >
> > Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks
> and
> > WAL/DB on NVMe, rbd configuration replica 3/2
> >
> > Cheers,
> > rv
> >
> > Few outputs:
> >
> > $ sudo ceph -s
> >   cluster:
> > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
> > health: HEALTH_ERR
> > 3 scrub errors
> > Possible data damage: 3 pgs inconsistent
> >
> >   services:
> > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
> > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
> > mds: cephfs_home-2/2/2 up
> > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby
> > osd: 126 osds: 126 up, 126 in
> >
> >   data:
> > pools:   3 pools, 4224 pgs
> > objects: 23.35M objects, 20.9TiB
> > usage:   64.9TiB used, 136TiB / 201TiB avail
> > pgs: 4221 active+clean
> >  3active+clean+inconsistent
> >
> >   io:
> > client:   2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr
> >
> > $ sudo ceph health detail
> > HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
> > OSD_SCRUB_ERRORS 3 scrub errors
> > PG_DAMAGED Possible data damage: 3 pgs inconsistent
> > pg 9.27 is active+clean+inconsistent, acting [78,107,96]
> > pg 9.260 is active+clean+inconsistent, acting [84,113,62]
> > pg 9.6b9 is active+clean+inconsistent, acting [79,107,80]
> > $ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados
> > list-inconsistent-obj 9.27 --format=json-pretty |grep error
> > "errors": [],
> > "union_shard_errors": [
> > "read_error"
> > "errors": [
> > "read_error"
> > "errors": [],
> > "errors": [],
> > $ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados
> > list-inconsistent-obj 9.260 --format=json-pretty |grep error
> > "errors": [],
> > "union_shard_errors": [
> > "read_error"
> > "errors": [],
> > "errors": [],
> > "errors": [
> > "read_error"
> > $ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados
> > list-inconsistent-obj 9.6b9 --format=json-pretty |grep error
> > "errors": [],
> > "union_shard_errors": [
> > "read_error"
> > "errors": [
> > "read_error"
> > "errors": [],
> > "errors": [],
> > $ sudo ceph pg repair 9.27
> > instructing pg 9.27 on osd.78 to repair
> > $ sudo ceph pg repair 9.260
> > instructing pg 9.260 on osd.84 to repair
> > $ sudo ceph pg repair 9.6b9
> > instructing pg 9.6b9 on osd.79 to repair
> > $ sudo ceph -s
> >   cluster:
> > id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
> > health: HEALTH_OK
> >
> >   services:
> > mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
> > mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
> > mds: cephfs_home-2/2/2 up
> > {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby
> > osd: 126 osds: 126 up, 126 in
> >
> >   data:
> > pools:   3 pools, 4224 pgs
> > objects: 23.35M objects, 20.9TiB
> > usage:   64.9TiB used, 136TiB / 201TiB avail
> > pgs: 4224 active+clean
> >
> >   io:
> > 

Re: [ceph-users] Possible data damage: 1 pg inconsistent

2018-12-21 Thread Christoph Adomeit
Hi,

same here but also for pgs in cephfs pools.

As far as I know there is a known bug that under memory pressure some reads 
return zero
and this will lead to the error message.

I have set nodeep-scrub and i am waiting for 12.2.11.

Thanks
  Christoph

On Fri, Dec 21, 2018 at 03:23:21PM +0100, Hervé Ballans wrote:
> Hi Frank,
> 
> I encounter exactly the same issue with the same disks than yours. Every
> day, after a batch of deep scrubbing operation, ther are generally between 1
> and 3 inconsistent pgs, and that, on different OSDs.
> 
> It could confirm a problem on these disks, but :
> 
> - it concerns only the pgs of the rbd pool, not those of cephfs pools (the
> same disk model is used)
> 
> - I encounter this when I was running 12.2.5, not when I upgraded in 12.2.8
> but the problem appears again after upgrade in 12.2.10
> 
> - On my side, smartctl and dmesg do not show any media error, so I'm pretty
> sure that physical media is not concerned...
> 
> Small precision: each disk is configured with RAID0 on a PERC740P, is this
> also the case for you or are your disks in JBOD mode ?
> 
> Another question: in your case, the OSD who is involved in the inconsistent
> pgs is it always the same one or is it a new one every time ?
> 
> For information, currently, the manually 'ceph pg repair' command works well
> each time...
> 
> Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks and
> WAL/DB on NVMe, rbd configuration replica 3/2
> 
> Cheers,
> rv
> 
> Few outputs:
> 
> $ sudo ceph -s
>   cluster:
>     id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
>     health: HEALTH_ERR
>     3 scrub errors
>     Possible data damage: 3 pgs inconsistent
> 
>   services:
>     mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
>     mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
>     mds: cephfs_home-2/2/2 up
> {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby
>     osd: 126 osds: 126 up, 126 in
> 
>   data:
>     pools:   3 pools, 4224 pgs
>     objects: 23.35M objects, 20.9TiB
>     usage:   64.9TiB used, 136TiB / 201TiB avail
>     pgs: 4221 active+clean
>  3    active+clean+inconsistent
> 
>   io:
>     client:   2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr
> 
> $ sudo ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
> OSD_SCRUB_ERRORS 3 scrub errors
> PG_DAMAGED Possible data damage: 3 pgs inconsistent
>     pg 9.27 is active+clean+inconsistent, acting [78,107,96]
>     pg 9.260 is active+clean+inconsistent, acting [84,113,62]
>     pg 9.6b9 is active+clean+inconsistent, acting [79,107,80]
> $ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados
> list-inconsistent-obj 9.27 --format=json-pretty |grep error
>     "errors": [],
>     "union_shard_errors": [
>     "read_error"
>     "errors": [
>     "read_error"
>     "errors": [],
>     "errors": [],
> $ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados
> list-inconsistent-obj 9.260 --format=json-pretty |grep error
>     "errors": [],
>     "union_shard_errors": [
>     "read_error"
>     "errors": [],
>     "errors": [],
>     "errors": [
>     "read_error"
> $ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados
> list-inconsistent-obj 9.6b9 --format=json-pretty |grep error
>     "errors": [],
>     "union_shard_errors": [
>     "read_error"
>     "errors": [
>     "read_error"
>     "errors": [],
>     "errors": [],
> $ sudo ceph pg repair 9.27
> instructing pg 9.27 on osd.78 to repair
> $ sudo ceph pg repair 9.260
> instructing pg 9.260 on osd.84 to repair
> $ sudo ceph pg repair 9.6b9
> instructing pg 9.6b9 on osd.79 to repair
> $ sudo ceph -s
>   cluster:
>     id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
>     health: HEALTH_OK
> 
>   services:
>     mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
>     mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
>     mds: cephfs_home-2/2/2 up
> {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby
>     osd: 126 osds: 126 up, 126 in
> 
>   data:
>     pools:   3 pools, 4224 pgs
>     objects: 23.35M objects, 20.9TiB
>     usage:   64.9TiB used, 136TiB / 201TiB avail
>     pgs: 4224 active+clean
> 
>   io:
>     client:   195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr
> 
> 
> 
> Le 19/12/2018 à 04:48, Frank Ritchie a écrit :
> >Hi all,
> >
> >I have been receiving alerts for:
> >
> >Possible data damage: 1 pg inconsistent
> >
> >almost daily for a few weeks now. When I check:
> >
> >rados list-inconsistent-obj $PG --format=json-pretty
> >
> >I will always see a read_error. When I run a deep scrub 

[ceph-users] CephFS MDS optimal setup on Google Cloud

2018-12-21 Thread Mahmoud Ismail
Hello,

I'm doing benchmarks for metadata operations on CephFS, HDFS, and HopsFS on
Google Cloud. In my current setup, i'm using 32 vCPU machines with 29 GB
memory, and i have 1 MDS, 1 MON and 3 OSDs. The MDS and the MON nodes are
co-located on one vm, while each of the OSDs is on a separate vm with 1 SSD
disk attached. I'm using the default configuration for MDS, and OSDs.

I'm running 300 clients on 10 machines (16 vCPU), each client creates a
CephFileSystem using the CephFS hadoop plugin, and then writes empty files
for 30 seconds followed by reading the empty files for another 30 seconds.
The aggregated throughput is around 2000 file create opertions/sec and
1 file read operations/sec. However, the MDS is not fully utilizing the
32 cores on the machine, is there any configuration that i should consider
to fully utilize the machine?.

Also, i noticed that running more than 20-30 clients (on different threads)
per machine degrade the aggregated throughput for read, is there a
limitation on CephFileSystem and libceph on the number of clients created
per machine?

Another issue,  Are the MDS operations single threaded as pointed here "
https://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
"?
Regarding the MDS global lock, is it it a single lock per MDS or is it a
global distributed lock for all MDSs?

Regards,
Mahmoud
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible data damage: 1 pg inconsistent

2018-12-21 Thread Hervé Ballans

Hi Frank,

I encounter exactly the same issue with the same disks than yours. Every 
day, after a batch of deep scrubbing operation, ther are generally 
between 1 and 3 inconsistent pgs, and that, on different OSDs.


It could confirm a problem on these disks, but :

- it concerns only the pgs of the rbd pool, not those of cephfs pools 
(the same disk model is used)


- I encounter this when I was running 12.2.5, not when I upgraded in 
12.2.8 but the problem appears again after upgrade in 12.2.10


- On my side, smartctl and dmesg do not show any media error, so I'm 
pretty sure that physical media is not concerned...


Small precision: each disk is configured with RAID0 on a PERC740P, is 
this also the case for you or are your disks in JBOD mode ?


Another question: in your case, the OSD who is involved in the 
inconsistent pgs is it always the same one or is it a new one every time ?


For information, currently, the manually 'ceph pg repair' command works 
well each time...


Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disks 
and WAL/DB on NVMe, rbd configuration replica 3/2


Cheers,
rv

Few outputs:

$ sudo ceph -s
  cluster:
    id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
    health: HEALTH_ERR
    3 scrub errors
    Possible data damage: 3 pgs inconsistent

  services:
    mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
    mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
    mds: cephfs_home-2/2/2 up 
{0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby

    osd: 126 osds: 126 up, 126 in

  data:
    pools:   3 pools, 4224 pgs
    objects: 23.35M objects, 20.9TiB
    usage:   64.9TiB used, 136TiB / 201TiB avail
    pgs: 4221 active+clean
 3    active+clean+inconsistent

  io:
    client:   2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr

$ sudo ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
    pg 9.27 is active+clean+inconsistent, acting [78,107,96]
    pg 9.260 is active+clean+inconsistent, acting [84,113,62]
    pg 9.6b9 is active+clean+inconsistent, acting [79,107,80]
$ sudo rados list-inconsistent-obj 9.27 --format=json-prettyrados 
list-inconsistent-obj 9.27 --format=json-pretty |grep error

    "errors": [],
    "union_shard_errors": [
    "read_error"
    "errors": [
    "read_error"
    "errors": [],
    "errors": [],
$ sudo rados list-inconsistent-obj 9.260 --format=json-prettyrados 
list-inconsistent-obj 9.260 --format=json-pretty |grep error

    "errors": [],
    "union_shard_errors": [
    "read_error"
    "errors": [],
    "errors": [],
    "errors": [
    "read_error"
$ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyrados 
list-inconsistent-obj 9.6b9 --format=json-pretty |grep error

    "errors": [],
    "union_shard_errors": [
    "read_error"
    "errors": [
    "read_error"
    "errors": [],
    "errors": [],
$ sudo ceph pg repair 9.27
instructing pg 9.27 on osd.78 to repair
$ sudo ceph pg repair 9.260
instructing pg 9.260 on osd.84 to repair
$ sudo ceph pg repair 9.6b9
instructing pg 9.6b9 on osd.79 to repair
$ sudo ceph -s
  cluster:
    id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
    mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2
    mds: cephfs_home-2/2/2 up 
{0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby

    osd: 126 osds: 126 up, 126 in

  data:
    pools:   3 pools, 4224 pgs
    objects: 23.35M objects, 20.9TiB
    usage:   64.9TiB used, 136TiB / 201TiB avail
    pgs: 4224 active+clean

  io:
    client:   195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr



Le 19/12/2018 à 04:48, Frank Ritchie a écrit :

Hi all,

I have been receiving alerts for:

Possible data damage: 1 pg inconsistent

almost daily for a few weeks now. When I check:

rados list-inconsistent-obj $PG --format=json-pretty

I will always see a read_error. When I run a deep scrub on the PG I 
will see:


head candidate had a read error

When I check dmesg on the osd node I see:

blk_update_request: critical medium error, dev sdX, sector 123

I will also see a few uncorrected read errors in smartctl.

Info:
Ceph: ceph version 12.2.4-30.el7cp
OSD: Toshiba 1.8TB SAS 10K
120 OSDs total

Has anyone else seen these alerts occur almost daily? Can the errors 
possibly be due to deep scrubbing too aggressively?


I realize these errors indicate potential failing drives but I can't 
replace a drive daily.


thx
Frank




Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread David C
I'm in a similar situation, currently running filestore with spinners and
journals on NVME partitions which are about 1% of the size of the OSD. If I
migrate to bluestore, I'll still only have that 1% available. Per the docs,
if my block.db device fills up, the metadata is going to spill back onto
the block device which will incur an understandable perfomance penalty. The
question is, will there be more of performance hit in that scenario versus
if the block.db was on the spinner and just the WAL was on the NVME?

On Fri, Dec 21, 2018 at 9:01 AM Janne Johansson  wrote:

> Den tors 20 dec. 2018 kl 22:45 skrev Vladimir Brik
> :
> > Hello
> > I am considering using logical volumes of an NVMe drive as DB or WAL
> > devices for OSDs on spinning disks.
> > The documentation recommends against DB devices smaller than 4% of slow
> > disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so
> > dividing it equally will result in each OSD getting ~90GB DB NVMe
> > volume, which is a lot less than 4%. Will this cause problems down the
> road?
>
> Well, apart from the reply you already got on "one nvme fails all the
> HDDs it is WAL/DB for",
> the recommendations are about getting the best out of them, especially
> for the DB I suppose.
>
> If one can size stuff up before, then following recommendations is a
> good choice, but I think
> you should test using it for WALs for instance, and bench it against
> another host with data,
> wal and db on the HDD and see if it helps a lot in your expected use case.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread Konstantin Shalygin

I am considering using logical volumes of an NVMe drive as DB or WAL
devices for OSDs on spinning disks.

The documentation recommends against DB devices smaller than 4% of slow
disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so
dividing it equally will result in each OSD getting ~90GB DB NVMe
volume, which is a lot less than 4%. Will this cause problems down the road?


This host for 145Tb data on HDDs. You'll be in agony when this NVMe will 
be dead.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread Janne Johansson
Den tors 20 dec. 2018 kl 22:45 skrev Vladimir Brik
:
> Hello
> I am considering using logical volumes of an NVMe drive as DB or WAL
> devices for OSDs on spinning disks.
> The documentation recommends against DB devices smaller than 4% of slow
> disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so
> dividing it equally will result in each OSD getting ~90GB DB NVMe
> volume, which is a lot less than 4%. Will this cause problems down the road?

Well, apart from the reply you already got on "one nvme fails all the
HDDs it is WAL/DB for",
the recommendations are about getting the best out of them, especially
for the DB I suppose.

If one can size stuff up before, then following recommendations is a
good choice, but I think
you should test using it for WALs for instance, and bench it against
another host with data,
wal and db on the HDD and see if it helps a lot in your expected use case.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com