from:"Willem Jan Withagen"

Re: [ceph-users] Osd auth del

2019-12-03 Thread Willem Jan Withagen


On 3-12-2019 11:43, Wido den Hollander wrote:



On 12/3/19 11:40 AM, John Hearns wrote:

I had a fat fingered moment yesterday
I typed                       ceph auth del osd.3
Where osd.3 is an otherwise healthy little osd
I have not set noout or down on  osd.3 yet

This is a Nautilus cluster.
ceph health reports everything is OK



Fetch the key from the OSD's datastore on the machine itself. On the OSD 
machine you'll find a file called keyring.


Get that file and import it with the proper caps back into cephx. Then 
all should be fixed!


The magic incantation there would be:

ceph auth add osd. osd 'allow *' mon 'allow rwx' keyring

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Willem Jan Withagen


On 15/02/2019 11:56, Dan van der Ster wrote:

On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:


On 15/02/2019 10:39, Ilya Dryomov wrote:

On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:


Hi Marc,

You can see previous designs on the Ceph store:

https://www.proforma.com/sdscommunitystore


Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...


Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)


I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.


Like:

https://proformaprostores.com/Category


at least, that is where I get directed to.

--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Willem Jan Withagen


On 15/02/2019 10:39, Ilya Dryomov wrote:

On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:


Hi Marc,

You can see previous designs on the Ceph store:

https://www.proforma.com/sdscommunitystore


Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...


Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-28 Thread Willem Jan Withagen


On 28-1-2019 02:56, Will Dennis wrote:

I mean to use CephFS on this PoC; the initial use would be to back up an 
existing ZFS server with ~43TB data (may have to limit the backed-up data 
depending on how much capacity I can get out of the OSD servers) and then share 
out via NFS as a read-only copy, that would give me some I/O speeds on writes 
and reads, and allow me to test different aspects of Ceph before I go pitching 
it as a primary data storage technology (it will be our org's first foray into 
SDS, and I want it to succeed.)

No way I'd go primary production storage with this motley collection of 
"pre-loved" equipment :) If it all seems to work well, I think I could get a 
reasonable budget for new production-grade gear.


Perhaps superfluous, my 2ct anyways.

I'd carefully define the term: "all seems to work well".

I'm running several ZFS instances of equal or bigger size, that are 
specifically tuned (buses, ssds, memory and ARC ) to their usage. And 
they usually do perform very well.


No if you define "work well" as performance close to what you get out of 
your zfs store be careful not to compare pears to lemons. You might 
need rather beefy HW to get to the ceph-cluster performance at the same 
level as your ZFS.


So you'd better define you PoC target with real expectations.

--WjW




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scheduling deep-scrub operations

2018-12-14 Thread Willem Jan Withagen


On 14/12/2018 13:42, Alexandru Cucu wrote:

Hi,

Unfortunately there is no way of doing this from the Ceph
configuration but you could create some cron jobs to add and remove
the nodeep-scrub flag.
The only problem would be that your cluster status will show
HEALTH_WARN but i think you could set/unset the flags per pool to
avoid this.


Don't disable the deep-srubs, but set the interval for deep scrubbing to 
INT_MAX. (or a month if you'd like to play it a bit safer)

And then start scrubbing thru cron.

But as Wido usually says:
"if you can't bear the scrub load, then your cluster is undersized."

And I tend to agree, because the punishment on the cluster in case of 
serious remapping can be way much higher.


--WjW


On Fri, Dec 14, 2018 at 1:25 PM Caspar Smit  wrote:


Hi all,

We have operating hours from 4 pm until 7 am each weekday and 24 hour days in 
the weekend.

I was wondering if it's possible to allow deep-scrubbing from 7 am until 15 pm 
only on weekdays and prevent any deep-scrubbing in the weekend.

I've seen the osd scrub begin/end hour settings but that doesn't allow for 
preventing deep-scrubs in the weekend.

Kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-27 Thread Willem Jan Withagen

On 26/09/2018 12:41, Eugen Block wrote:

Hi,

I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even 
if not all OSDs come back up you should allow the cluster to backfill 
PGs. Not sure, but unsetting norebalance could also be useful, but 
that can be done step by step. First watch if the cluster gets any 
better without it.

The best way to see if recovery is doing its stuff, is look at the 
recovering pgs in

    ceph pg dump

And look at the objects and see that some of the counters are actually 
going down.

If they don't then the PG is not recovering/backfilling.

Haven't found a better way to determine this (yet).

--WjW

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.

The suggested config settings look reasonable to mee. You should also 
try to raise the timeouts for the MONs and increase their db cache as 
suggested earlier today.

after this point, if an osd is down, it's fine...it'll only prevent 
access to that specific data (bad for clients, fine for recovery)

I agree with that, the cluster state has to become stable first, then 
you can take a look into those OSDs that won't get up.

Regards,
Eugen

Zitat von by morphin :

Hello Eugen.  Thank you for your answer. I was loosing my hope to get
an answer here.

I faced so many times with losing 2/3 mons but I never faced any
problem like this on luminous.
The recovery still works and its have been 30hours.  The last state of
my cluster is: https://paste.ubuntu.com/p/rDNHCcNG7P/
We are discussing should we unset the nodown, norecover flags or not 
on IRC.

I tried unset the nodown flag yesterday and I have 15 osd do not start
anymore with same error --> : https://paste.ubuntu.com/p/94xpzxTSnr/
I dont know what is the reason of this but I saw some commits for the
dump problem. Is this bug or something else?

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.
What do you think?
Eugen Block , 26 Eyl 2018 Çar, 12:54 tarihinde şunu 
yazdı:

Hi,

could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:

  - disable cephx auth with 'auth_cluster_required = none'
  - set the mon_osd_cache_size = 20 (default 10)
  - Setting 'osd_heartbeat_interval = 30'
  - setting 'mon_lease = 75'
  - increase the rocksdb_cache_size and leveldb_cache_size on the mons
to be big enough to cache the entire db

I just copied the mentioned steps, so please read the thread before
applying anything.

Regards,
Eugen

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030018.html 

Zitat von by morphin :

> After I tried too many things with so many helps on IRC. My pool
> health is still in ERROR and I think I can't recover from this.
> https://paste.ubuntu.com/p/HbsFnfkYDT/
> At the end 2 of 3 mons crashed and started at same time and the pool
> is offlined. Recovery takes more than 12hours and it is way too slow.
> Somehow recovery seems to be not working.
>
> If I can reach my data I will re-create the pool easily.
> If I run ceph-object-tool script to regenerate mon store.db can I
> acccess the RBD pool again?
> by morphin , 25 Eyl 2018 Sal, 20:03
> tarihinde şunu yazdı:
>>
>> Hi,
>>
>> Cluster is still down :(
>>
>> Up to not we have managed to compensate the OSDs. 118s of 160 OSD 
are

>> stable and cluster is still in the progress of settling. Thanks for
>> the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
>> flapping OSDs stable.
>>
>> What we learned up now is that this is the cause of unsudden 
death of

>> 2 monitor servers of 3. And when they come back if they do not start
>> one by one (each after joining cluster) this can happen. Cluster can
>> be unhealty and it can take countless hour to come back.
>>
>> Right now here is our status:
>> ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
>> health detail: https://paste.ubuntu.com/p/w4gccnqZjR/
>>
>> Since OSDs disks are NL-SAS it can take up to 24 hours for an online
>> cluster. What is most it has been said that we could be extremely 
luck

>> if all the data is rescued.
>>
>> Most unhappily our strategy is just to sit and wait :(. As soon 
as the
>> peering and activating count drops to 300-500 pgs we will restart 
the
>> stopped OSDs one by one. For each OSD and we will wait the 
cluster to

>> settle down. The amount of data stored is OSD is 33TB. Our most
>> concern is to export our rbd pool data outside to a backup space. 
Then

>> we will start again with clean one.
>>
>> I hope to justify our analysis with an expert. Any help or advise
>> would be greatly appreciated.
>> by morphin , 25 Eyl 2018 Sal, 15:08
>> tarihinde şunu

Re: [ceph-users] mgr/dashboard: backporting Ceph Dashboard v2 to Luminous

2018-08-23 Thread Willem Jan Withagen


On 23/08/2018 12:47, Ernesto Puerta wrote:


@Willem, given your comments come from a technical ground, let's
address those technically. As you say, dashboard_v2 is already in
Mimic and will be soon in Nautilus when released, so for FreeBSD the
issue will anyhow be there. Let's look for a technical solution (both
short and long-term): shall we have meeting to handle the FreeBSD
specifics?


I have bit more time than that...
Usually I start makeing a port once the release is stable, so in this 
case 14.2.x. Becaue the FreeBSD users that like to live dangerously 
usually can also build their own source from the repo.


But you right, one day I need to bite the bullit.

ATM I'm contemplating all kinds of scenarios to make it workable for all 
participants: devs, porters, users. And I haven't really found the 
golden egg.

So yes, lets take it private and talk.

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mgr/dashboard: backporting Ceph Dashboard v2 to Luminous

2018-08-23 Thread Willem Jan Withagen


On 23/08/2018 11:22, Lenz Grimmer wrote:

On 08/22/2018 08:57 PM, David Turner wrote:


My initial reaction to this PR/backport was questioning why such a
major update would happen on a dot release of Luminous.  Your
reaction to keeping both dashboards viable goes to support that.
Should we really be backporting features into a dot release that
force people to change how they use the software?  That seems more of
the purpose of having new releases.


This is indeed an unusual case. But considering that the Dashboard does
not really change any of the Ceph core functionality but adds a lot of
value by improving the usability and manageability of Ceph, we agreed
with Sage on making an exception here.


I haven't really used either dashboard though.  Other than adding
admin functionality, does it remove any functionality of the previous
dashboard?


Like Kai wrote, our initial goal was to reach feature parity with
Dashboard v1, in order to not introduce a regression when replacing it.

In the meanwhile, Dashboard v2 is way beyond that and we have added a
lot of additional functionality, e.g. RBD and RGW management.

With backporting this to Luminous, we also hope to reach a larger
audience of users that have not updated to Mimic yet.


Wierd approach question

Is there still some "logic" separation between the v1 and v2 stuff?
And is it possible to build what is used to be v1 without importing the 
npm stuff? That way I can "easily" package and distribute that piece.


ATM this plan really cripples my possibilities to package a bit of 
dashboard... where L and M did have something workable.


Just keeping it in the tree as "dead wood" for the time being would help 
me get people a taste of what is there. Just ripping it out does not 
help. And I'll try to work out a solution before the release of 
Nautilus, or accept that it will be gone and has to be v2.


--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mgr/dashboard: backporting Ceph Dashboard v2 to Luminous

2018-08-23 Thread Willem Jan Withagen


On 22/08/2018 19:42, Ernesto Puerta wrote:

Thanks for your feedback, Willem!


The old dashboard does not need any package fetch while
building/installing. Something that is not very handy when building
FreeBSD packages. And I haven't gotten around to determining how to > get 
around that.


I thought that https://github.com/ceph/ceph/pull/22562 fixed make-dist
issues on FreeBSD. Is that not working yet? Let us know if that's the
case!


Eh, yes and no
The PR allows me to build dashboard in tree. All the guarding with 
nodeenv and such are not there. So that is already less clean than it 
should be, but it works for testing.


Now building a FreeBSD package is a totally different cookie.
That does not work with the way it is now designed.
FreeBSD package building does not allow data to be fetch half way thru 
the build process!!! So calling something like npm during building is taboo.


There is a fetch stage, where al the sources and submodules are fetched.
And I have 2 make targets: pre_fetch and post_fetch where I can get 
extra sources that are required.

Problem is also that the sources are only unpacked AFTER post_fetch, so
even running a part of the Cmake code in post_fetch does not work, 
because the tree is not there.


And it hard to track what npm is al installing, because it seems it is 
pulling in a complete forrest of new dependancies.


So my way out at the moment is probably going to be:
offline build the tree,
take all the dashboard resulting source.
including what gets installed in /usr/local. :(
put it in a blob
put the blob in a FreeBSD port where I fetch the blob
make a package from that.
and then we have net/ceph-dashboard. :(
I did not mention that I need to do this for 2-3 releases 8-|


Suggest renaming it to simpledash or dashboard_v1 and keep it in the
tree.


Unfortunately, keeping v1 is not as simple as moving the dashboard to
a separate directory (unless we leave it hanging as dead code).
Dashboard_v2 completely replaces dashboard_v1, and that also means
unit test, QA suites, and references in common files (install-deps.sh,
CMakeLists.txt, ceph.spec.in, debian/*, do_freebsd.sh, vstart.sh,
etc.).

My concern is that properly keeping both ones would go beyond a
long-but-mostly-clean cherry-picking. It'd involve Luminous actively
diverging from master, which might burden other backports with
manual/creative conflict-solving.


Yes, I understand the concerns. But mine are not less complicated.
And especially if all the parts of v1 are also in v2, it requires "just" 
a smart division of tests ao. So that v1 can be tested as a subset.

I know easier said then done.

--WjW


KR,

Ernesto

On Wed, Aug 22, 2018 at 12:43 PM Willem Jan Withagen  wrote:


On 22/08/2018 12:16, Ernesto Puerta wrote:

[sent both to ceph-devel and ceph-users lists, as it might be of
interest for both audiences]

Hi all,

This e-mail is just to announce the WIP on backporting dashboard_v2
(http://docs.ceph.com/docs/master/mgr/dashboard/) from master to
Luminous release.

The ultimate goal for this backport is to replace dashboard_v1 in
Luminous and provide, **as much as possible**, (see note below) a
level of functionality on a par with master's (i. e.: RBD and RGW
management, HTTPS support, User Management, Role Based Access Control,
Grafana integration, SSO, etc.).


If done so, I would prefer to also keep the old "simple" Dashboard.
Reason for that is the ease of portability.

The old dashboard does not need any package fetch while
building/installing. Something that is not very handy when building
FreeBSD packages. And I haven't gotten around to determining how to get
around that.

Next to that: that dashboard is "simple". Something I really like, but
that is perhaps personal.

Suggest renaming it to simpledash or dashboard_v1 and keep it in the tree.

Thanx,
--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mgr/dashboard: backporting Ceph Dashboard v2 to Luminous

2018-08-22 Thread Willem Jan Withagen


On 22/08/2018 12:16, Ernesto Puerta wrote:

[sent both to ceph-devel and ceph-users lists, as it might be of
interest for both audiences]

Hi all,

This e-mail is just to announce the WIP on backporting dashboard_v2
(http://docs.ceph.com/docs/master/mgr/dashboard/) from master to
Luminous release.

The ultimate goal for this backport is to replace dashboard_v1 in
Luminous and provide, **as much as possible**, (see note below) a
level of functionality on a par with master's (i. e.: RBD and RGW
management, HTTPS support, User Management, Role Based Access Control,
Grafana integration, SSO, etc.).


If done so, I would prefer to also keep the old "simple" Dashboard.
Reason for that is the ease of portability.

The old dashboard does not need any package fetch while 
building/installing. Something that is not very handy when building 
FreeBSD packages. And I haven't gotten around to determining how to get 
around that.


Next to that: that dashboard is "simple". Something I really like, but 
that is perhaps personal.


Suggest renaming it to simpledash or dashboard_v1 and keep it in the tree.

Thanx,
--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph configuration; Was: FreeBSD rc.d script: sta.rt not found

2018-08-21 Thread Willem Jan Withagen


Norman,

I'm cc-ing this back to ceph-users for others the reply to or in future 
to find


On 21/08/2018 12:01, Norman Gray wrote:


Willem Jan, hello.

Thanks for your detailed notes on my list question.

On 20 Aug 2018, at 21:32, Willem Jan Withagen wrote:


 # zpool create -m/var/lib/ceph/osd/osd.0 osd.0 gpt/zd000 gpt/zd001


Over the weekend I update the Ceph manual for FreeBSD manual, with 
exactly that.
I 'm not sure what sort of devices zd000 and zd001 are, but concating 
devices seriously lowers the MTBF for the vdev. And as such it is 
likely better to create 2 OSDs on these 2 devices.


My sort-of problem is that the machine I'm doing this on was not specced 
with Ceph in mind: it has 16 3.5TB disks.  Given that 
<http://docs.ceph.com/docs/master/start/hardware-recommendations/> 
suggests that 20 is a 'high' number of OSDs on a host, I thought it 
might be better to aim for an initial setup of 6 two-disk OSDs rather 
than 12 one-disk ones (leaving four disks free).


That said, 12 < 20, so I think that, especially bearing in mind your 
advice here, I should probably stick to 1-disk OSDs with one (default) 
5GB SSD journal each, and not complicate things.


Only one way to find out: try both...
But I certainly do not advise to put concat disks in an OSD. Especially 
not for production. Break one disk, you break the vdev.


And the most important thing for OSDs is 1G per 1T of disk.
So with 70T of disk you'd need 64 or more of RAM, preferably more since 
ZFS will want his share as well..
CPUs there is not going to that much of a issue. Unless you have real 
tiny CPUs.


What I still have not figured out is what to do with the SSDs.
There are 3 things you can do (or in any combination)
1) Ceph standard: make it a journal. Mount the SSD on a separate dir and
get ceph-disk to start using it as journal
2) Attach a ZFS cache to the vdev which will improve reading
3) Attach a ZFS log on SSD to the vdev to improve sync writing.

At the moment I'm doing all three:
[~] w...@freetest.digiware.nl> zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
osd.0.journal  316K  5.33G88K 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.0/journal-ssd
osd.1.journal  316K  5.33G88K 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.1/journal-ssd
osd.2.journal  316K  5.33G88K 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.2/journal-ssd
osd.3.journal  316K  5.33G88K 
/usr/jails/ceph_3/var/lib/ceph/osd/osd.3/journal-ssd
osd.4.journal  316K  5.33G88K 
/usr/jails/ceph_4/var/lib/ceph/osd/osd.4/journal-ssd
osd.5.journal  316K  5.33G88K 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.5/journal-ssd
osd.6.journal  316K  5.33G88K 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.6/journal-ssd
osd.7.journal  316K  5.33G88K 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.7/journal-ssd
osd_0 5.16G   220G  5.16G 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.0
osd_1 5.34G   219G  5.34G 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.1
osd_2 5.42G   219G  5.42G 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.2
osd_3 6.62G  1.31T  6.62G 
/usr/jails/ceph_3/var/lib/ceph/osd/osd.3
osd_4 6.83G  1.75T  6.83G 
/usr/jails/ceph_4/var/lib/ceph/osd/osd.4
osd_5 5.92G  1.31T  5.92G 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.5
osd_6 6.00G  1.31T  6.00G 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.6
osd_7 6.10G  1.31T  6.10G 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.7


[~] w...@freetest.digiware.nl> zpool list -v osd_1
NAMESIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAGCAP 
DEDUP  HEALTH  ALTROOT
osd_1   232G  5.34G   227G- - 0% 2% 
1.00x  ONLINE  -

  gpt/osd_1 232G  5.34G   227G- - 0% 2%
log-  -  - -  -  -
  gpt/osd.1.log 960M12K   960M- - 0% 0%
cache  -  -  - -  -  -
  gpt/osd.1.cache  22.0G  1.01G  21.0G- - 0% 4%

So each OSD has a SSD journal (zfs volume) and each osd volume has cache 
and log. ATM the cluster is idle, so hence the log is "empty"


But I would first work on the architecture of how you want the cluster 
to be, and then start tuning. fs log and cache are easily added and 
removed after the fact.


I found what appear to be a couple of typos in your script which I can 
report back to you.  I hope to make significant progress with this work 
this week, so should be able to give you more feedback on the script, on 
my experiences, and on the FreeBSD page in the manual.


Sure, keep'm coming

--WjW


I'll work through your various notes.  Below are a couple of specific 
points.



When I attempt to start the service, I get:

# service ceph start
=== mon.pochhammer ===


You're sort of free to pick names, but most of the time

Re: [ceph-users] FreeBSD rc.d script: sta.rt not found

2018-08-16 Thread Willem Jan Withagen


On 16/08/2018 11:01, Willem Jan Withagen wrote:

Hi Norman,

Thanx for trying the Ceph port.
As you will find out it is still rough around the edges...
But please feel free to ask questions (on the ceph-user list)

Which I will try to help answer as good as I can.
And also feel free to send me feedback as much as you can to improve
either the code, and/or documentation.

One of the fixes you will run into is that it was suggested to create an 
osd zpool volume with:

  zpool create -o mountpoint=/var/lib/ceph/osd/osd.1 osd
And that is in some older man-pages.

This will not work, it needs to be:
 gpart create -s GPT ada1
 gpart add -t freebsd-zfs -l osd1 ada1
 zpool create zpool gpt/osd1
 zfs create -o mountpoint=/var/lib/ceph/osd/osd.1 osd.1


In the mean time I have uploaded a PR to fix this in the manual, which 
should read:

  gpart create -s GPT ada1
  gpart add -t freebsd-zfs -l osd.1 ada1
  zpool create osd.1 gpt/osd.1
  zfs create -o mountpoint=/var/lib/ceph/osd/osd.1 osd.1

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD rc.d script: sta.rt not found

2018-08-16 Thread Willem Jan Withagen


Hi Norman,

Thanx for trying the Ceph port.
As you will find out it is still rough around the edges...
But please feel free to ask questions (on the ceph-user list)

Which I will try to help answer as good as I can.
And also feel free to send me feedback as much as you can to improve
either the code, and/or documentation.

One of the fixes you will run into is that it was suggested to create an 
osd zpool volume with:

 zpool create -o mountpoint=/var/lib/ceph/osd/osd.1 osd
And that is in some older man-pages.

This will not work, it needs to be:
gpart create -s GPT ada1
gpart add -t freebsd-zfs -l osd1 ada1
zpool create zpool gpt/osd1
zfs create -o mountpoint=/var/lib/ceph/osd/osd.1 osd.1

--WjW


On 16/08/2018 00:45, Willem Jan Withagen wrote:

On 15/08/2018 19:46, Norman Gray wrote:


Greetings.

I'm having difficulty starting up the ceph monitor on FreeBSD.  The
rc.d/ceph script appears to be doing something ... odd.

I'm following the instructions on
<http://docs.ceph.com/docs/master/install/manual-freebsd-deployment/>.
I've configured a monitor called mon.pochhammer

When I try to start the service with

 # service ceph start

I get an error

 /usr/local/bin/init-ceph: sta.rt not found
(/usr/local/etc/ceph/ceph.conf defines mon.pochhammer, /var/lib/ceph
defines )

This appears to be because, in ceph_common.sh's get_name_list(), $orig
is 'start' and allconf ends up as ' mon.pochhammer'.  In that function,
the value of $orig is then worked through word-by-word, whereupon
'start' is split into 'sta' and 'rt', which fails to match a test a few
lines later.

Calling 'service ceph start' results in /usr/local/bin/ceph-init being
called with arguments 'start start', and calling 'service ceph start
start mon.pochhammer' (as the above instructions recommend) results in
'run_rc_command start start start mon.pochhammer'.  Is the ceph-init
script perhaps missing a 'shift' at some point before the sourcing of
ceph_common.sh?

Incidentally, that's a rather unexpected call to the rc.d script -- I
would have expected just 'service ceph start' as above.  The latter call
does seem to extract the correct mon.pochhammer monitor name from the
correct config file, even if the presence of the word 'start' does then
confuse it.

This is FreeBSD 11.2, and ceph-conf version 12.2.7, built from the
FreeBSD ports tree.


This is an error in the /usr/local/etc/rc.d/ceph file.

The last line should look like:
 run_rc_command "$1"

The double set of commands is confusing init-ceph.

Init-ceph or rc.d/ceph should be rewritten, but I just have not yet 
gotten to that. Also because in near future ceph-disk goes a way and the 
config starts looking different/less important. And I have not yet 
decided how to fit the parts together.


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD rc.d script: sta.rt not found

2018-08-15 Thread Willem Jan Withagen


On 15/08/2018 19:46, Norman Gray wrote:


Greetings.

I'm having difficulty starting up the ceph monitor on FreeBSD.  The
rc.d/ceph script appears to be doing something ... odd.

I'm following the instructions on
.
I've configured a monitor called mon.pochhammer

When I try to start the service with

     # service ceph start

I get an error

     /usr/local/bin/init-ceph: sta.rt not found
(/usr/local/etc/ceph/ceph.conf defines mon.pochhammer, /var/lib/ceph
defines )

This appears to be because, in ceph_common.sh's get_name_list(), $orig
is 'start' and allconf ends up as ' mon.pochhammer'.  In that function,
the value of $orig is then worked through word-by-word, whereupon
'start' is split into 'sta' and 'rt', which fails to match a test a few
lines later.

Calling 'service ceph start' results in /usr/local/bin/ceph-init being
called with arguments 'start start', and calling 'service ceph start
start mon.pochhammer' (as the above instructions recommend) results in
'run_rc_command start start start mon.pochhammer'.  Is the ceph-init
script perhaps missing a 'shift' at some point before the sourcing of
ceph_common.sh?

Incidentally, that's a rather unexpected call to the rc.d script -- I
would have expected just 'service ceph start' as above.  The latter call
does seem to extract the correct mon.pochhammer monitor name from the
correct config file, even if the presence of the word 'start' does then
confuse it.

This is FreeBSD 11.2, and ceph-conf version 12.2.7, built from the
FreeBSD ports tree.


This is an error in the /usr/local/etc/rc.d/ceph file.

The last line should look like:
run_rc_command "$1"

The double set of commands is confusing init-ceph.

Init-ceph or rc.d/ceph should be rewritten, but I just have not yet 
gotten to that. Also because in near future ceph-disk goes a way and the 
config starts looking different/less important. And I have not yet 
decided how to fit the parts together.


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Make a ceph options persist

2018-08-13 Thread Willem Jan Withagen


On 13/08/2018 10:51, John Spray wrote:

On Fri, Aug 10, 2018 at 10:40 AM Willem Jan Withagen  wrote:


Hi,

The manual of dashboard suggests:
 ceph config-key set mgr/dashboard/server_addr ${MGR_IP}

But that command is required after reboot.


config-key settings are persistent.  The docs are probably just
telling you to restart the mgr daemon after setting it?


Ah, oke, right.

I guess I have misread the wording. Will reread to see where I went wrong.

Thanx,
--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Make a ceph options persist

2018-08-10 Thread Willem Jan Withagen


Hi,

The manual of dashboard suggests:
ceph config-key set mgr/dashboard/server_addr ${MGR_IP}

But that command is required after reboot.

I have tried all kinds of versions, but was not able to get it working...

How do I put this into a permanent version in /etc/ceph/ceph.conf

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why lvm is recommended method for bleustore

2018-07-23 Thread Willem Jan Withagen


On 22-7-2018 15:51, Satish Patel wrote:

I read that post and that's why I open this thread for few more questions and 
clearence,

When you said OSD doesn't come up what actually that means?  After reboot of 
node or after service restart or installation of new disk?

You said we are using manual method what is that?

I'm building new cluster and had zero prior experience so how can I produce 
this error to see lvm is really life saving tool here? I'm sure there are 
plenty of people using but I didn't find and good document except that mailing 
list which raising more questions in my mind.


Satish

It is a choice made during the design of the new setup with ceph-volume.
For reasons set out by Sage in one of the refered posts.

Just as there are many engineering questions that get solved by 
selecting a tool that does the work, in this case LVM.
And I do not think it was given a huge amount of consideration to use 
it. If I would guess the possibility to hard add attributes to volumes 
is going to be one of the selectors.
(I'm not even sure the if an alternative low impact middle layer that 
can do disk abstraction on Linux)


LVM is sort of a first tool of trade if you do not want to deal with the 
raw disks...
And as Marc said: You need to start a full study on the possible 
alternatives to the raised questions.


I personally would not waste the time for that. On the developers list 
it has been gone over a few post about ceph-volume and things were 
rarely about the selection of LVM.


--WjW


Sent from my iPhone


On Jul 22, 2018, at 6:31 AM, Marc Roos  wrote:



I don’t think it will get any more basic than that. Or maybe this? If
the doctor diagnoses you, you can either accept this, get 2nd opinion,
or study medicine to verify it.

In short lvm has been introduced to solve some issues of related to
starting osd's (which I did not have, probably because of a 'manual'
configuration). And it opens the ability to support (more future)
devices.

I gave you two links, did you read the whole thread?
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47802.html





-Original Message-
From: Satish Patel [mailto:satish@gmail.com]
Sent: zaterdag 21 juli 2018 20:59
To: ceph-users
Subject: [ceph-users] Why lvm is recommended method for bleustore

Folks,

I think i am going to boil ocean here, I google a lot about this topic
why lvm is recommended method for bluestore, but didn't find any good
and detail explanation, not even in Ceph official website.

Can someone explain here in basic language because i am no way expert so
just want to understand what is the advantage of adding extra layer of
complexity?

I found this post but its not i got lost reading it and want to see what
other folks suggesting and offering in their language
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg46768.html

~S
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] JBOD question

2018-07-21 Thread Willem Jan Withagen


On 21/07/2018 01:45, Oliver Freyermuth wrote:

Hi Satish,

that really completely depends on your controller.



This is what I get on an older AMCC 9550 controller.
Note that the disk type is set to JBOD. But the disk descriptors are 
hidden. And you'll never know what more is not done right.


Geom name: da6
Providers:
1. Name: da6
   Mediasize: 1000204886016 (932G)
   Sectorsize: 512
   Mode: r1w1e2
   descr: AMCC 9550SXU-8L DISK
   lunname: AMCCZ1N00KBD
   lunid: AMCCZ1N00KBD
   ident: Z1N00KBD
   rotationrate: unknown
   fwsectors: 63
   fwheads: 255

This is an LSI 9802 controller in IT mode:
(And that gives me a bit more faith)
Geom name: da7
Providers:
1. Name: da7
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Mode: r1w1e1
   descr: WDC WD30EFRX-68AX9N0
   lunid: 0004d927f870
   ident: WD-WMC1T4088693
   rotationrate: unknown
   fwsectors: 63
   fwheads: 255

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RAID question for Ceph

2018-07-19 Thread Willem Jan Withagen


On 19/07/2018 13:28, Satish Patel wrote:

Thanks for massive details, so what are the options I have can I disable raid 
controller and run system without raid and use software raid for OS?


Not sure what kind of RAID controller you have. I seem to recall and HP 
thingy? And those I don't trust at all in HBA mode, I've heard to much 
bad things about them: They keep messing with the communication to the disk.

Also not sure you can get a firmware version that does HBA only.


Does that make sense ?


Well I run ZFS on FreeBSD, and usually run a zfs mirror for my OS disks.
I guess that for the OS partition it not really matters what you do.
Even RAID it on the controller is not that important. Linux will be able 
to manage that. And your OS disk are not going to be > 4-6T. So relative 
oke recover times, and no serious performance requirements.


So you could do either.

Normally we would for a ZFS/Ceph system:
 - have 2 small disks mirrored for OS. Now a days you can get 64-128G
sata DOM for this. Saves 2 trays in the front.
Or get a cabinet with 2 2,5"s in the back.
   Connected to the motherboard
 - For a 24 Tray cabinet:
disktray backplane with individual lanes to each disk
(have to specifically ask SM for that)
motherboard with at least 3* 8 PCIe lanes, 2* 10G onboard.
(would prefer 3*16, but those are relatively rare
  or they are not fully wired. And require CPUs
with 48 lanes )
3 LSI HBA 9207-8i to connect the trays.

--WjW


Sent from my iPhone


On Jul 19, 2018, at 6:33 AM, Willem Jan Withagen  wrote:


On 19/07/2018 10:53, Simon Ironside wrote:

On 19/07/18 07:59, Dietmar Rieder wrote:
We have P840ar controllers with battery backed cache in our OSD nodes
and configured an individual RAID-0 for each OSD (ceph luminous +
bluestore). We have not seen any problems with this setup so far and
performance is great at least for our workload.

I'm doing the same with LSI RAID controllers for the same reason, to take 
advantage of the battery backed cache. No problems with this here either. As 
Troy said, you do need to go through the additional step of creating a single 
disk RAID0 whenever you replace a disk that you wouldn't with regular HBA.


This discussion has been running on ZFS lists for quite some time and extend...
Since ZFS really does depend on that the software wants direct access to the 
disk without extra abstraction layers.
And as both with ZFS and Ceph RAID is dead. these newly designed storage 
systems solve problems that RAID cannot anymore.
(Read about why new RAID versions will not really save you from crashed disk 
due to a MTBF time that equals recovery time on new large disks.)

Basic fact remains that RAID controllers sort of lie to the users, and even 
more the advanced ones with backup batteries. If everything is all well in 
paradise you will usually get away with it. But if not, that expensive piece of 
hardware will turn everything in to cr..p.

For example lots of LSI firmware has had bugs in them, especially the 
Enterprise version can do really wierd things. That is why we install the IT 
version of the firmware, as to cripple the RAID functionality as much as one 
can. It turns your expensive RAID controller basically into just a plain HBA. 
(no more configs for extra disks.)

So unless you HAVE to take it, because you can not rule it out in the system 
configurator whilest buying. Go for the simple controllers that can act as HBA.

There are a few more things to consider, like
- what is the bandwidth on the disk carrier backplane?
What kind of port multipliers are used, and is the design as
it should be. I've seen board with 2 multipliers where it turns
out that only one is used, and the other only can be used for
multipath... So is going to be a bottleneck on the feed
to the multiplier?
- how many lanes from your expensive HBA with multi lane SAS/SATA are
actually used?
I have seen 24 tray backplanes that want to run over 2 or 4 SAS
lanes. Even when you think you are using all 8 lanes from the
HBA because you have 2 SFF-8087 cables.
It is not for a reason that SuperMicro also has a disktray
backplane with 24 individual wired out SAS/SATA ports.
Just ordering the basic cabinet will probably get you the wrong
stuff.
  - And once you sort have fixed the bottlenecks here, can you actually
run all disks at full speed over the controller to the PCI
bus(ses).
   Even a 16 lane PCIe slot will at very theoretical best do 16Gbit/s.
   Now connect a bunch of 12Gb/s SSD disks to this connector and see
   the bottleneck arise. Even with more than 20 HDD it is going to be
   crowed on this controller.

Normally I'd say: Lies, damned lies, and statistics.
But in this case: Lies, damned lies and hardware. 8-D

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.

Re: [ceph-users] RAID question for Ceph

2018-07-19 Thread Willem Jan Withagen


On 19/07/2018 10:53, Simon Ironside wrote:

On 19/07/18 07:59, Dietmar Rieder wrote:


We have P840ar controllers with battery backed cache in our OSD nodes
and configured an individual RAID-0 for each OSD (ceph luminous +
bluestore). We have not seen any problems with this setup so far and
performance is great at least for our workload.


I'm doing the same with LSI RAID controllers for the same reason, to 
take advantage of the battery backed cache. No problems with this here 
either. As Troy said, you do need to go through the additional step of 
creating a single disk RAID0 whenever you replace a disk that you 
wouldn't with regular HBA.


This discussion has been running on ZFS lists for quite some time and 
extend...
Since ZFS really does depend on that the software wants direct access to 
the disk without extra abstraction layers.
And as both with ZFS and Ceph RAID is dead. these newly designed 
storage systems solve problems that RAID cannot anymore.
(Read about why new RAID versions will not really save you from crashed 
disk due to a MTBF time that equals recovery time on new large disks.)


Basic fact remains that RAID controllers sort of lie to the users, and 
even more the advanced ones with backup batteries. If everything is all 
well in paradise you will usually get away with it. But if not, that 
expensive piece of hardware will turn everything in to cr..p.


For example lots of LSI firmware has had bugs in them, especially the 
Enterprise version can do really wierd things. That is why we install 
the IT version of the firmware, as to cripple the RAID functionality as 
much as one can. It turns your expensive RAID controller basically into 
just a plain HBA. (no more configs for extra disks.)


So unless you HAVE to take it, because you can not rule it out in the 
system configurator whilest buying. Go for the simple controllers that 
can act as HBA.


There are a few more things to consider, like
 - what is the bandwidth on the disk carrier backplane?
What kind of port multipliers are used, and is the design as
it should be. I've seen board with 2 multipliers where it turns
out that only one is used, and the other only can be used for
multipath... So is going to be a bottleneck on the feed
to the multiplier?
 - how many lanes from your expensive HBA with multi lane SAS/SATA are
actually used?
I have seen 24 tray backplanes that want to run over 2 or 4 SAS
lanes. Even when you think you are using all 8 lanes from the
HBA because you have 2 SFF-8087 cables.
It is not for a reason that SuperMicro also has a disktray
backplane with 24 individual wired out SAS/SATA ports.
Just ordering the basic cabinet will probably get you the wrong
stuff.
  - And once you sort have fixed the bottlenecks here, can you actually
run all disks at full speed over the controller to the PCI
bus(ses).
   Even a 16 lane PCIe slot will at very theoretical best do 16Gbit/s.
   Now connect a bunch of 12Gb/s SSD disks to this connector and see
   the bottleneck arise. Even with more than 20 HDD it is going to be
   crowed on this controller.

Normally I'd say: Lies, damned lies, and statistics.
But in this case: Lies, damned lies and hardware. 8-D

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CentOS Dojo at CERN

2018-06-22 Thread Willem Jan Withagen


On 21-6-2018 14:44, Dan van der Ster wrote:

On Thu, Jun 21, 2018 at 2:41 PM Kai Wagner  wrote:


On 20.06.2018 17:39, Dan van der Ster wrote:

And BTW, if you can't make it to this event we're in the early days of
planning a dedicated Ceph + OpenStack Days at CERN around May/June
2019.
More news on that later...

Will that be during a CERN maintenance window?

*that would raise my interest dramatically :-)*


Yes, 2019 is a during the long shutdown 2:
https://lhc-commissioning.web.cern.ch/lhc-commissioning/schedule/LHC-long-term.htm

Part of the "more news on that later" is that we're trying to
understand if we'll be be able to go underground...


+1

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk is getting removed from master

2018-05-23 Thread Willem Jan Withagen

On 23-5-2018 17:12, Alfredo Deza wrote:
> Now that Mimic is fully branched out from master, ceph-disk is going
> to be removed from master so that it is no longer available for the N
> release (pull request to follow)

> Willem, we don't have a way of directly supporting FreeBSD, I've
> suggested that a plugin would be a good way to consume ceph-volume
> with whatever FreeBSD needs, alternatively forking ceph-disk could be
> another option?

Yup, I'm aware of my "trouble"/commitment.

Now that you have riped out most/all of the partitioning stuff there
should not much that one would need to do in ceph-volume other than
accept the filestore directories to format the MON/OSD stuff in.

IFF I could find the time to dive into ceph-volume. :(
ATM I'm having a hard time keeping up with the changes as it is.

I'd appreciate if you could delay yanking ceph-disk until we are close
to the nautilus release. At which point feel free to use the axe.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Willem Jan Withagen

On 21-3-2018 13:47, Paul Emmerich wrote:
> Hi,
> 
> 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
> DAC connections should be faster: switches are typically in the range of
> ~500ns to 1µs.
> 
> 
> But you'll find that this small difference in latency induced by the
> switch will be quite irrelevant in the grand scheme of things when using
> the Linux network stack...

But I think it does when people start to worry about selecting High
clock speed CPUS versus packages with more cores...

900ns is quite a lot if you have that mindset.
And probably 1800ns at that, because the delay will be a both ends.
Or perhaps even 3600ns because the delay is added at every ethernet
connector???

But I'm inclined to believe you that the network stack could take quite
some time...


--WjW


> Paul
> 
> 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>>:
> 
> Hi,
> 
> I just ran into this table for a 10G Netgear switch we use:
> 
> Fiberdelays:
> 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
> 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
> 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
> 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
> 
> Copperdelays:
> 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
> 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
> 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
> 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
> 
> Fiberdelays:
> 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
> 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
> 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
> 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
> 
> Copperdelays:
> 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
> 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
> 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
> 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
> 
> So the difference is serious: 900ns on a total of 1900ns for a 10G
> pakket.
> Other strange thing is that 1K packets are slower than 1518 bytes.
> 
> So that might warrant connecting boxes preferably with optics
> instead of CAT cableing if you are trying to squeeze the max out of
> a setup.
> 
> Sad thing is that they do not report for jumbo frames, and doing these
> measurements your self is not easy...
> 
> --WjW
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> -- 
> Paul Emmerich
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Willem Jan Withagen

Hi,

I just ran into this table for a 10G Netgear switch we use:

Fiberdelays:
10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs

Copperdelays:
10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs

Fiberdelays:
1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs

Copperdelays:
1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs

So the difference is serious: 900ns on a total of 1900ns for a 10G pakket.
Other strange thing is that 1K packets are slower than 1518 bytes.

So that might warrant connecting boxes preferably with optics
instead of CAT cableing if you are trying to squeeze the max out of a setup.

Sad thing is that they do not report for jumbo frames, and doing these
measurements your self is not easy...

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-03-03 Thread Willem Jan Withagen


On 23/02/2018 14:27, Caspar Smit wrote:

Hi All,

What would be the proper way to preventively replace a DB/WAL SSD (when 
it is nearing it's DWPD/TBW limit and not failed yet).


It hosts DB partitions for 5 OSD's

Maybe something like:

1) ceph osd reweight 0 the 5 OSD's
2) let backfilling complete
3) destroy/remove the 5 OSD's
4) replace SSD
5) create 5 new OSD's with seperate DB partition on new SSD

When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so 
i thought maybe the following would work:


1) ceph osd set noout
2) stop the 5 OSD's (systemctl stop)
3) 'dd' the old SSD to a new SSD of same or bigger size
4) remove the old SSD
5) start the 5 OSD's (systemctl start)
6) let backfilling/recovery complete (only delta data between OSD stop 
and now)

6) ceph osd unset noout

Would this be a viable method to replace a DB SSD? Any udev/serial 
nr/uuid stuff preventing this to work?


What I would do under FreeBSD/ZFS (and perhaps there is something under 
Linux that works the same):


Promote the the disk/zvol for the DB/WAL to mirror.
  This is instantaneous, and does not modify anything.
Add the new SSD to the mirror, and wait until the new SSD is updated.
Then I'dd delete the old SSD from the mirror.

You'd be stuck with a mirror with one disk for the DB/WALL, but that 
does not consume much. ZFS does not even think it is wrong, if you 
deleted the disk in the correct way.


And no reboot required.

No idea if you can do something similar under LVM or other types of 
mirroring stuff.


--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk vs. ceph-volume: both error prone

2018-02-11 Thread Willem Jan Withagen


On 09/02/2018 21:56, Alfredo Deza wrote:

On Fri, Feb 9, 2018 at 10:48 AM, Nico Schottelius
 wrote:


Dear list,

for a few days we are disecting ceph-disk and ceph-volume to find out,
what is the appropriate way of creating partitions for ceph.


ceph-volume does not create partitions for ceph



For years already I found ceph-disk (and especially ceph-deploy) very
error prone and we at ungleich are considering to rewrite both into a
ceph-block-do-what-I-want-tool.


This is not very simple, that is the reason why there are tools that
do this for you.



Only considering bluestore, I see that ceph-disk creates two partitions:

Device  StartEndSectors   Size Type
/dev/sde12048 206847 204800   100M Ceph OSD
/dev/sde2  206848 2049966046 2049759199 977.4G unknown

Does somebody know, what exactly belongs onto the xfs formatted first
disk and how is the data/wal/db device sde2 formatted?


If you must, I would encourage you to try ceph-disk out with full
verbosity and dissect all the system calls, which will answer how the
partitions are formatted



What I really would like to know is, how can we best extract this
information so that we are not depending on ceph-{disk,volume} anymore.


Initially you mentioned partitions, but you want to avoid ceph-disk
and ceph-volume wholesale? That is going to take a lot more effort.
These tools not only "prepare" devices
for Ceph consumption, they also "activate" them when a system boots,
it talks to the cluster to register the OSDs, etc... It isn't just
partitioning (for ceph-disk).


I personally find it very annoying that ceph-disk tries to be friends 
with all the init-tools that are with all linuxes. Let alone all the 
udev stuff that starts working on disks once they are introduced in the 
system.


And for FreeBSD I'm not suggesting to use that since it does not fit 
with with the FreeBSD paradigm that things like this are not really 
automagically started.


So if it is only about creating the ceph-infra, things are relatively easy.

The actual work on the partitions is done with ceph-osd --mkfs and there 
is little magic about it. And then some more options tell where the 
parts for BlueStore go if you want something that is not the STD location.


Also a large part of ceph-disk is complicated/abfuscated by desires to 
run on crypted disks and or multipath disk providers...
Running it with verbose on, gives a bit of info, but the python-code is 
convoluted and complex until you have it figured out. Then it starts to 
become simpler, but never easy. ;-)


Writing a script that does what ceph-disk does? Take a look at 
src/vstart in the source. That script builds a full cluster during 
testing and is way more legible.
I did so for my FreeBSD multi-server cluster tests, and it is not 
complex at all.


Just my 2cts,
--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] formatting bytes and object counts in ceph status ouput

2018-01-03 Thread Willem Jan Withagen

On 3-1-2018 00:44, Dan Mick wrote:
> On 01/02/2018 08:54 AM, John Spray wrote:
>> On Tue, Jan 2, 2018 at 10:43 AM, Jan Fajerski  wrote:
>>> Hi lists,
>>> Currently the ceph status output formats all numbers with binary unit
>>> prefixes, i.e. 1MB equals 1048576 bytes and an object count of 1M equals
>>> 1048576 objects.  I received a bug report from a user that printing object
>>> counts with a base 2 multiplier is confusing (I agree) so I opened a bug and
>>> https://github.com/ceph/ceph/pull/19117.
>>> In the PR discussion a couple of questions arose that I'd like to get some
>>> opinions on:
>>
>>> - Should we print binary unit prefixes (MiB, GiB, ...) since that would be
>>> technically correct?
>>
>> I'm not a fan of the technically correct base 2 units -- they're still
>> relatively rarely used, and I've spent most of my life using kB to
>> mean 1024, not 1000.
>>
>>> - Should counters (like object counts) be formatted with a base 10
>>> multiplier or  a multiplier woth base 2?
>>
>> I prefer base 2 for any dimensionless quantities (or rates thereof) in
>> computing.  Metres and kilograms go in base 10, bytes go in base 2.
>>
>> It's all very subjective and a matter of opinion of course, and my
>> feelings aren't particularly strong :-)
>>
>> John
> 
> 100% agreed.  "iB" is an affectation IMO.  But I'm grumpy and old.

+1 on all 3 cases. :)

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deterministic naming of LVM volumes (ceph-volume)

2017-12-13 Thread Willem Jan Withagen

On 13-12-2017 10:36, Stefan Kooman wrote:
> Hi,
> 
> The new style "ceph-volume" LVM way of provisioning OSDs introduces a
> little challange for us. In order to create the OSDs as logical,
> consistent and easily recognizable as possible, we try to name the
> Volume Groups (VG) and Logical Volumes (LV) the same as the OSD. For
> example: OSD no. 12 will be named /dev/osd.12/osd.12. So we don't use:
> 
> "ceph-volume lvm create /dev/device" 
> 
> but use:
> 
> "ceph-volume lvm prepare --bluestore --data osd.$OSD_ID/osd.$OSD_ID"
> 
> and 
> 
> "ceph-volume lvm activate --bluestore $OSD_ID $OSD_FSID" 
> 
> However, this way of provisioning requires to know the OSD_ID before
> creating the VG/LV. Is there a way to ask Ceph which OSD_ID
> would be next up?

Stephan,

The ceph-disk code does something like this:

def allocate_osd_id(
cluster,
fsid,
keyring,
path,
):
"""
Allocates an OSD id on the given cluster.

:raises: Error if the call to allocate the OSD id fails.
:return: The allocated OSD id.
"""
lockbox_path = os.path.join(STATEDIR, 'osd-lockbox', fsid)
lockbox_osd_id = read_one_line(lockbox_path, 'whoami')
osd_keyring = os.path.join(path, 'keyring')
if lockbox_osd_id:
LOG.debug('Getting OSD id from Lockbox...')
osd_id = lockbox_osd_id
shutil.move(os.path.join(lockbox_path, 'osd_keyring'),
osd_keyring)
path_set_context(osd_keyring)
os.unlink(os.path.join(lockbox_path, 'whoami'))
return osd_id

LOG.debug('Allocating OSD id...')
secrets = Secrets()
try:
wanttobe = read_one_line(path, 'wanttobe')
if os.path.exists(os.path.join(path, 'wanttobe')):
os.unlink(os.path.join(path, 'wanttobe'))
id_arg = wanttobe and [wanttobe] or []
osd_id = command_with_stdin(
[
'ceph',
'--cluster', cluster,
'--name', 'client.bootstrap-osd',
'--keyring', keyring,
'-i', '-',
'osd', 'new',
fsid,
] + id_arg,
secrets.get_json()
)
except subprocess.CalledProcessError as e:
raise Error('ceph osd create failed', e, e.output)
osd_id = must_be_one_line(osd_id)
check_osd_id(osd_id)
secrets.write_osd_keyring(osd_keyring, osd_id)
return osd_id


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Willem Jan Withagen


On 11/12/2017 15:13, Tobias Prousa wrote:

Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing 
/home to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 
3MDSs (1 active, 2 standby) and 28 OSDs in total. This cluster is up and 
running since the days of Bobtail (yes, including CephFS).


Might consider shutting down 1 MON, since MONs need to be in odd 
numbers, And for you cluster 3 is more than sufficient.


For reasons why, read either the Ceph docs, or search this maillinglist.

Probably doesn;t help with your problem, but could you help prevent a 
split-brain situation in the future.


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk is now deprecated

2017-11-28 Thread Willem Jan Withagen


On 28-11-2017 13:32, Alfredo Deza wrote:


I understand that this would involve a significant effort to fully
port over and drop ceph-disk entirely, and I don't think that dropping
ceph-disk in Mimic is set in stone (yet).


Alfredo,

When I expressed my concers about deprecating ceph-disk, I was led to 
beleive that I had atleast two release cycles to come up with something 
of a 'ceph-volume zfs '


Reading this, there is a possibility that it will get dropped IN mimic?
Which means that there is less than 1 release cycle to get it working?

Thanx,
--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Willem Jan Withagen

On 3-11-2017 00:09, Nigel Williams wrote:
> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> wrote:
>> I want to bring this subject back in the light and hope someone can provide
>> insight regarding the issue, thanks.

> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

Waisting resources is probably relative.

SSD have a limitted lifetime. And Ceph is a seriously hard (ab)user of
the wear for SSDs.

Now if you over dimension the allocated space, it looks like it is not
used. But onderneath in the SSD firmware writting is spread out over all
cells of the SSD. So the wear is evely distibuted over all componets of
the SSD.

And by overcommitting you have thus prolonged the life of you SSD.

So it is either buy more now, but less replacing.
Or allocate stricktly, and replace sooner.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

2017-11-01 Thread Willem Jan Withagen


On 01/11/2017 18:04, Chris Jones wrote:

Greg,

Thanks so much for the reply!

We are not clear on why ZFS is behaving poorly under some circumstances 
on getxattr system calls, but that appears to be the case.


Since the last update we have discovered that back-to-back booting of 
the OSD yields very fast boot time, and very fast getxattr system calls.


A longer period between boots (or perhaps related to influx of new data) 
correlates to longer boot duration. This is due to slow getxattr calls 
of certain types.


We suspect this may be a caching or fragmentation issue with ZFS for 
xattrs. Use of longer filenames appear to make this worse.


As far as I understand is a lot of this data stored in the metadata.
Which is (or can be) a different set in the (l2)arc cache.

So are you talking about a OSD reboot, or a system reboot?
Don't quite understand what you mean back-to-back...

I have little experience with ZFS on Linux.
So if behaviour there is different is hard for me to tell.

IF you are rebooting the OSD, I can imagine that with certain sequences 
of rebooting pre-loads the meta-cache. Reboots further apart can have 
lead to a different working set in the ZFS-caches. And then all data 
needs to be refetched, instead of getting it from l2arc.


And note that in newer ZFS versions the in memory ARC even can be 
compressed, leading to an even higher hit rate.


For example on my development server with 32Gb memory:
ARC: 20G Total, 1905M MFU, 16G MRU, 70K Anon, 557M Header, 1709M Other
 17G Compressed, 42G Uncompressed, 2.49:1 Ratio

--WjW


We experimented on some OSDs with swapping over to XFS as the 
filesystem, and the problem does not appear to be present on those OSDs.


The two examples below are representative of a Long Boot (longer running 
time and more data influx between osd rebooting) and a Short Boot where 
we booted the same OSD back to back.


Notice the drastic difference in time on the getxattr that yields the 
ENODATA return. Around 0.009 secs for "long boot" and "0.0002" secs when 
the same OSD is booted back to back. Long boot time is approx 40x to 50x 
longer. Multiplied by thousands of getxattr calls, this is/was our 
source of longer boot time.


We are considering a full switch to XFS, but would love to hear any ZFS 
tuning tips that might be a short term workaround.


We are using ZFS 6.5.11 prior to implementation of the ability to use 
large dnodes which would allow the use of dnodesize=auto.


#Long Boot
<0.44>[pid 3413902] 13:08:00.884238 
getxattr("/osd/9/current/20.86bs3_head/default.34597.7\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebana_1d9e1e82d623f49c994f_0_long", 
"user.cephos.lfn3", 
"default.34597.7\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-92d9df789f9aaf007c50c50bb66e70af__head_0177C86B__14__3", 
1024) = 616 <0.44>
<0.008875>[pid 3413902] 13:08:00.884476 
getxattr("/osd/9/current/20.86bs3_head/default.34597.57\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_79a7acf2d32f4302a1a4_0_long", 
"user.cephos.lfn3-alt", 0x7f849bf95180, 1024) = -1 ENODATA (No data 
available) <0.008875>


#Short Boot
<0.15> [pid 3452111] 13:37:18.604442 
getxattr("/osd/9/current/20.15c2s3_head/default.34597.22\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_efb8ca13c57689d76797_0_long", 
"user.cephos.lfn3", 
"default.34597.22\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-b519f8607a3d9de0f815d18b6905b27d__head_9726F5C2__14__3", 
1024) = 617 <0.15>
<0.18> [pid 3452111] 13:37:18.604546

Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Willem Jan Withagen

On 10-10-2017 14:21, Alfredo Deza wrote:
> On Tue, Oct 10, 2017 at 8:14 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
>> On 10-10-2017 13:51, Alfredo Deza wrote:
>>> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer <ch...@gol.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> (pet peeve alert)
>>>> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:
>>>>
>>>>> To put this in context, the goal here is to kill ceph-disk in mimic.
>>
>> Right, that means we need a ceph-volume zfs before things get shot down.
>> Fortunately there is little history to carry over.
>>
>> But then still somebody needs to do the work. ;-|
>> Haven't looked at ceph-volume, but I'll put it on the agenda.
> 
> An interesting take on zfs (and anything else we didn't set up from
> the get-go) is that we envisioned developers might
> want to craft plugins for ceph-volume and expand its capabilities,
> without placing the burden of coming up
> with new device technology to support.
> 
> The other nice aspect of this is that a plugin would get to re-use all
> the tooling in place in ceph-volume. The plugin architecture
> exists but it isn't fully developed/documented yet.

I was part of the original discussion when ceph-volume said it was going
to be plugable... And would be a great proponent of thye plugins.
If only because ceph-disk is rather convoluted to add to. Not that it
cannot be done, but the code is rather loaded with linuxisms for its
devices. And it takes some care to not upset the old code, even to the
point that code for a routine is refactored into 3 new routines: one OS
selctor and then the old code for Linux, and the new code for FreeBSD.
And that starts to look like a poor mans plugin. :)

But still I need to find the time, and sharpen my python skills.
Luckily mimic is 9 months away. :)

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Willem Jan Withagen

On 10-10-2017 13:51, Alfredo Deza wrote:
> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer  wrote:
>>
>> Hello,
>>
>> (pet peeve alert)
>> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:
>>
>>> To put this in context, the goal here is to kill ceph-disk in mimic.

Right, that means we need a ceph-volume zfs before things get shot down.
Fortunately there is little history to carry over.

But then still somebody needs to do the work. ;-|
Haven't looked at ceph-volume, but I'll put it on the agenda.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Willem Jan Withagen

On 29-8-2017 19:12, Steve Taylor wrote:
> Hong,
> 
> Probably your best chance at recovering any data without special,
> expensive, forensic procedures is to perform a dd from /dev/sdb to
> somewhere else large enough to hold a full disk image and attempt to
> repair that. You'll want to use 'conv=noerror' with your dd command
> since your disk is failing. Then you could either re-attach the OSD
> from the new source or attempt to retrieve objects from the filestore
> on it.

Like somebody else already pointed out
In problem "cases like disk, use dd_rescue.
It has really a far better chance of restoring a copy of your disk

--WjW

> I have actually done this before by creating an RBD that matches the
> disk size, performing the dd, running xfs_repair, and eventually
> adding it back to the cluster as an OSD. RBDs as OSDs is certainly a
> temporary arrangement for repair only, but I'm happy to report that it
> worked flawlessly in my case. I was able to weight the OSD to 0,
> offload all of its data, then remove it for a full recovery, at which
> point I just deleted the RBD.
> 
> The possibilities afforded by Ceph inception are endless. ☺
> 
> 
>  
> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 | 
>  
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
>  
> 
> On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote:
>> Rule of thumb with batteries is:
>> - more “proper temperature” you run them at the more life you get out
>> of them
>> - more battery is overpowered for your application the longer it will
>> survive. 
>>
>> Get your self a LSI 94** controller and use it as HBA and you will be
>> fine. but get MORE DRIVES ! … 
>>> On 28 Aug 2017, at 23:10, hjcho616  wrote:
>>>
>>> Thank you Tomasz and Ronny.  I'll have to order some hdd soon and
>>> try these out.  Car battery idea is nice!  I may try that.. =)  Do
>>> they last longer?  Ones that fit the UPS original battery spec
>>> didn't last very long... part of the reason why I gave up on them..
>>> =P  My wife probably won't like the idea of car battery hanging out
>>> though ha!
>>>
>>> The OSD1 (one with mostly ok OSDs, except that smart failure)
>>> motherboard doesn't have any additional SATA connectors available.
>>>  Would it be safe to add another OSD host?
>>>
>>> Regards,
>>> Hong
>>>
>>>
>>>
>>> On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz >> mail.com> wrote:
>>>
>>>
>>> Sorry for being brutal … anyway 
>>> 1. get the battery for UPS ( a car battery will do as well, I’ve
>>> moded on ups in the past with truck battery and it was working like
>>> a charm :D )
>>> 2. get spare drives and put those in because your cluster CAN NOT
>>> get out of error due to lack of space
>>> 3. Follow advice of Ronny Aasen on hot to recover data from hard
>>> drives 
>>> 4 get cooling to drives or you will loose more ! 
>>>
>>>
 On 28 Aug 2017, at 22:39, hjcho616  wrote:

 Tomasz,

 Those machines are behind a surge protector.  Doesn't appear to
 be a good one!  I do have a UPS... but it is my fault... no
 battery.  Power was pretty reliable for a while... and UPS was
 just beeping every chance it had, disrupting some sleep.. =P  So
 running on surge protector only.  I am running this in home
 environment.   So far, HDD failures have been very rare for this
 environment. =)  It just doesn't get loaded as much!  I am not
 sure what to expect, seeing that "unfound" and just a feeling of
 possibility of maybe getting OSD back made me excited about it.
 =) Thanks for letting me know what should be the priority.  I
 just lack experience and knowledge in this. =) Please do continue
 to guide me though this. 

 Thank you for the decode of that smart messages!  I do agree that
 looks like it is on its way out.  I would like to know how to get
 good portion of it back if possible. =)

 I think I just set the size and min_size to 1.
 # ceph osd lspools
 0 data,1 metadata,2 rbd,
 # ceph osd pool set rbd size 1
 set pool 2 size to 1
 # ceph osd pool set rbd min_size 1
 set pool 2 min_size to 1

 Seems to be doing some backfilling work.

 # ceph health
 HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 2
 pgs backfill_toofull; 74 pgs backfill_wait; 3 pgs backfilling;
 108 pgs degraded; 6 pgs down; 6 pgs inconsistent; 6 pgs peering;
 7 pgs recovery_wait; 16 pgs stale; 108 pgs stuck degraded; 6 pgs
 stuck inactive; 16 pgs stuck stale; 130 pgs stuck unclean; 101
 pgs stuck undersized; 101 pgs undersized; 1 requests are blocked
> 32 sec;

Re: [ceph-users] librados for MacOS

2017-08-03 Thread Willem Jan Withagen

On 03/08/2017 09:36, Brad Hubbard wrote:
> On Thu, Aug 3, 2017 at 5:21 PM, Martin Palma  wrote:
>> Hello,
>>
>> is there a way to get librados for MacOS? Has anybody tried to build
>> librados for MacOS? Is this even possible?
> 
> Yes, it is eminently possible, but would require a dedicated effort.
> 
> As far as I know there is no one working on this atm.

Looking at the code I've come across a few #ifdef's for OSX and sorts.
So attempts have been tried, but I think that code has rotted.
Now FreeBSD and MacOS have a partial similar background, so ATM I would
expect a MacOS port not to be all complex. And build on some of the
stuff I've done for FreeBSD. Not sure if the native compiler on Mac is
Clang, but all Clang issues are already fixed. (If Clang on Mac is at
least at 3.8)

Liek Btad says:It does require persistence, and testing. But most
important, it will also require maintenance.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Willem Jan Withagen

On 21-7-2017 12:45, Fulvio Galeazzi wrote:
> Hallo David, all,
> sorry for hi-jacking the thread but I am seeing the same issue,
> although on 10.2.7/10.2.9...

Then this is a problem that had nothing to do with my changes to
ceph-disk, since they only went into HEAD and thus end up in Luminous.
Which is fortunate, since I know nothing about systemd and all its magic.

Not to say anything about the previous reported problem.
BUt that also went away when ceph-disk was used differently.

--WjW

> 
> 
> Note that I am using disks taken from a SAN, so the GUIDs in my case are
> those relevant to MPATH.
> As per other messages in this thread, I modified:
>  - /usr/lib/systemd/system/ceph-osd.target
>adding to [Unit] stanza:
> Before=ceph.target
>  - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules
>added at the end of this line:
> ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*",
> ENV{ID_PART_ENTRY_UUID}=="?*",
> SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}"
> 
>the string:
> , SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"
> 
> 
> 
> df shows (picked a problematic partition and one which mounted OK)
> .
> /dev/mapper/3600a0980005de737095a56c510cd1  3878873588  142004
> 3878731584   1% /var/lib/ceph/osd/cephba1-27
> /dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116  202720
> 7779728396   1% /var/lib/ceph/tmp/mnt.XL7WkY
> 
> Yet, for both the GUIDs seem correct:
> 
> === /dev/mapper/3600a0980005de737095a56c510cd
> Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3
> Partition size: 7761536991 sectors (3.6 TiB)
> Partition name: 'ceph data'
> Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36
> Partition size: 4096 sectors (19.5 GiB)
> Partition name: 'ceph journal'
> 
> === /dev/mapper/3600a0980005ddf751e2558e2bac7
> Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD
> Partition size: 15564036063 sectors (7.2 TiB)
> Partition name: 'ceph data'
> Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B
> Partition size: 4096 sectors (19.5 GiB)
> Partition name: 'ceph journal'
> 
>   I rather suspect some sort of race condition, possibly causing hitting
> some timeout within systemctl... (please read the end of this message).
> I am led to think this because the OSDs which are successfully mounted
> after each reboot are a "random" subset of the configured ones (total
> ~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd
> apparently gives up.
> 
> 
> The only workaround I found to get things going is re-running
> ceph-ansible, but it takes s long...
> 
> Have you any idea as to what is going on here? Has anybody seen (and
> solved) the same issue?
> 
>   Thanks!
> 
> Fulvio
> 
> 
> 
> 
> 
> [root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami
> 143
> [root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY
> [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
> ● ceph-osd@143.service - Ceph object storage daemon
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
> vendor preset: disabled)
>Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23
> CEST; 1h 35min ago
>   Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER}
> --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
>   Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>  Main PID: 40466 (code=exited, status=1/FAILURE)
> 
> 
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service:
> main process exited, code=exited, status=1/FAILURE
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit
> ceph-osd@143.service entered failed state.
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> failed.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> holdoff time over, scheduling restart.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated
> too quickly for ceph-osd@143.service
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph
> object storage daemon.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit
> ceph-osd@143.service entered failed state.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> failed.
> [root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service
> [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
> ● ceph-osd@143.service - Ceph object storage daemon
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
> vendor preset: disabled)
>Active: activating (auto-restart) (Result:

Re: [ceph-users] ceph-disk activate-block: not a block device

2017-07-20 Thread Willem Jan Withagen


Hi Roger,

Device detection has recently changed (because FreeBSD does not have 
blockdevices).
So could very well be that this is an actual problem where something is 
still wrong.

Please keep an eye out, and let me know if it comes back.

--WjW

Op 20-7-2017 om 19:29 schreef Roger Brown:

So I disabled ceph-disk and will chalk it up as a red herring to ignore.


On Thu, Jul 20, 2017 at 11:02 AM Roger Brown > wrote:


Also I'm just noticing osd1 is my only OSD host that even has an
enabled target for ceph-disk (ceph-disk@dev-sdb2.service).

roger@osd1:~$ systemctl list-units ceph*
  UNIT   LOAD   ACTIVE SUB DESCRIPTION
● ceph-disk@dev-sdb2.service loaded failed failed  Ceph disk
activation: /dev/sdb2
  ceph-osd@3.service loaded active running Ceph object
storage daemon osd.3
  ceph-mds.targetloaded active active  ceph target
allowing to start/stop all ceph-mds@.service instances at once
  ceph-mgr.targetloaded active active  ceph target
allowing to start/stop all ceph-mgr@.service instances at once
  ceph-mon.targetloaded active active  ceph target
allowing to start/stop all ceph-mon@.service instances at once
  ceph-osd.targetloaded active active  ceph target
allowing to start/stop all ceph-osd@.service instances at once
  ceph-radosgw.targetloaded active active  ceph target
allowing to start/stop all ceph-radosgw@.service instances at once
  ceph.targetloaded active active  ceph target
allowing to start/stop all ceph*@.service instances at once

roger@osd2:~$ systemctl list-units ceph*
UNITLOAD   ACTIVE SUB DESCRIPTION
ceph-osd@4.service  loaded active running Ceph object storage
daemon osd.4
ceph-mds.target loaded active active  ceph target allowing to
start/stop all ceph-mds@.service instances at once
ceph-mgr.target loaded active active  ceph target allowing to
start/stop all ceph-mgr@.service instances at once
ceph-mon.target loaded active active  ceph target allowing to
start/stop all ceph-mon@.service instances at once
ceph-osd.target loaded active active  ceph target allowing to
start/stop all ceph-osd@.service instances at once
ceph-radosgw.target loaded active active  ceph target allowing to
start/stop all ceph-radosgw@.service instances at once
ceph.target loaded active active  ceph target allowing to
start/stop all ceph*@.service instances at once

roger@osd3:~$ systemctl list-units ceph*
UNITLOAD   ACTIVE SUB DESCRIPTION
ceph-osd@0.service  loaded active running Ceph object storage
daemon osd.0
ceph-mds.target loaded active active  ceph target allowing to
start/stop all ceph-mds@.service instances at once
ceph-mgr.target loaded active active  ceph target allowing to
start/stop all ceph-mgr@.service instances at once
ceph-mon.target loaded active active  ceph target allowing to
start/stop all ceph-mon@.service instances at once
ceph-osd.target loaded active active  ceph target allowing to
start/stop all ceph-osd@.service instances at once
ceph-radosgw.target loaded active active  ceph target allowing to
start/stop all ceph-radosgw@.service instances at once
ceph.target loaded active active  ceph target allowing to
start/stop all ceph*@.service instances at once


On Thu, Jul 20, 2017 at 10:23 AM Roger Brown
> wrote:

I think I need help with some OSD trouble. OSD daemons on two
hosts started flapping. At length, I rebooted host osd1
(osd.3), but the OSD daemon still fails to start. Upon closer
inspection, ceph-disk@dev-sdb2.service is failing to start due
to, "Error: /dev/sdb2 is not a block device"

This is the command I see it failing to run:

roger@osd1:~$ sudo /usr/sbin/ceph-disk --verbose
activate-block /dev/sdb2
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in 
load_entry_point('ceph-disk==1.0.0', 'console_scripts',
'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py",
line 5731, in run
main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py",
line 5682, in main
args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py",
line 5438, in 
func=lambda args: main_activate_space(name, args),
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py",
line 4160, in main_activate_space
osd_uuid = get_space_osd_uuid(name, dev)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py",

Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Willem Jan Withagen

On 26-6-2017 09:01, Christian Wuerdig wrote:
> Well, preferring faster clock CPUs for SSD scenarios has been floated
> several times over the last few months on this list. And realistic or
> not, Nick's and Kostas' setup are similar enough (testing single disk)
> that it's a distinct possibility.
> Anyway, as mentioned measuring the performance counters would probably
> provide more insight.

I read the advise as:
prefer GHz over cores.

And especially since there is a sort of balance between either GHz or
cores, that can be an expensive one. Getting both means you have to pay
relatively substantial more money.

And for an average Ceph server with plenty OSDs, I personally just don't
buy that. There you'd have to look at the total throughput of the the
system, and latency is only one of the many factors.

Let alone in a cluster with several hosts (and or racks). There the
latency is dictated by the network. So a bad choice of network card or
switch will out do any extra cycles that your CPU can burn.

I think that just testing 1 OSD is testing artifacts, and has very
little to do with running an actual ceph cluster.

So if one would like to test this, the test setup should be something
like: 3 hosts with something like 3 disks per host, min_disk=2  and a
nice workload.
Then turn the GHz-knob and see what happens with client latency and
throughput.

--WjW

> On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>> wrote:
> 
> 
> 
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar <mmokh...@petasan.org
> <mailto:mmokh...@petasan.org>> het volgende geschreven:
> 
>> My understanding was this test is targeting latency more than
>> IOPS. This is probably why its was run using QD=1. It also makes
>> sense that cpu freq will be more important than cores. 
>>
> 
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space. 
> As there are so many
> 
> --WjW
> 
>> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>>
>>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>>>> The general advice floating around is that your want CPUs with high
>>>> clock speeds rather than more cores to reduce latency and
>>>> increase IOPS
>>>> for SSD setups (see also
>>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
>>>> <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/>)
>>>> So
>>>> something like a E5-2667V4 might bring better results in that
>>>> situation.
>>>> Also there was some talk about disabling the processor C states
>>>> in order
>>>> to bring latency down (something like this should be easy to test:
>>>> https://stackoverflow.com/a/22482722/220986
>>>> <https://stackoverflow.com/a/22482722/220986>)
>>>
>>> I would be very careful to call this a general advice...
>>>
>>> Although the article is interesting, it is rather single sided.
>>>
>>> The only thing is shows that there is a lineair relation between
>>> clockspeed and write or read speeds???
>>> The article is rather vague on how and what is actually tested.
>>>
>>> By just running a single OSD with no replication a lot of the
>>> functionality is left out of the equation.
>>> Nobody is running just 1 osD on a box in a normal cluster host.
>>>
>>> Not using a serious SSD is another source of noise on the conclusion.
>>> More Queue depth can/will certainly have impact on concurrency.
>>>
>>> I would call this an observation, and nothing more.
>>>
>>> --WjW
>>>>
>>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>>>> <reverend...@gmail.com <mailto:reverend...@gmail.com>
>>>> <mailto:reverend...@gmail.com <mailto:reverend...@gmail.com>>>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> We are in the process of evaluating the performance of a testing
>>>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>>>> 3 monitors (VMs)
>>>> 2 physical servers each connected with 1 JBOD running Ubuntu
>>>> Server
>>>> 16.04
>>>>
>>>> Each server has 32 threads @2.1GHz and 128GB RAM.
>>>> The disk distribution per server is:
>>&

Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Willem Jan Withagen



> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar <mmokh...@petasan.org> het 
> volgende geschreven:
> 
> My understanding was this test is targeting latency more than IOPS. This is 
> probably why its was run using QD=1. It also makes sense that cpu freq will 
> be more important than cores. 
> 

But then it is not generic enough to be used as an advise!
It is just a line in 3D-space. 
As there are so many

--WjW
>> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>> 
>>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>>> The general advice floating around is that your want CPUs with high
>>> clock speeds rather than more cores to reduce latency and increase IOPS
>>> for SSD setups (see also
>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>>> something like a E5-2667V4 might bring better results in that situation.
>>> Also there was some talk about disabling the processor C states in order
>>> to bring latency down (something like this should be easy to test:
>>> https://stackoverflow.com/a/22482722/220986)
>> 
>> I would be very careful to call this a general advice...
>> 
>> Although the article is interesting, it is rather single sided.
>> 
>> The only thing is shows that there is a lineair relation between
>> clockspeed and write or read speeds???
>> The article is rather vague on how and what is actually tested.
>> 
>> By just running a single OSD with no replication a lot of the
>> functionality is left out of the equation.
>> Nobody is running just 1 osD on a box in a normal cluster host.
>> 
>> Not using a serious SSD is another source of noise on the conclusion.
>> More Queue depth can/will certainly have impact on concurrency.
>> 
>> I would call this an observation, and nothing more.
>> 
>> --WjW
>>> 
>>> 
>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>>> <reverend...@gmail.com <mailto:reverend...@gmail.com>> wrote:
>>> 
>>> Hello,
>>> 
>>> We are in the process of evaluating the performance of a testing
>>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>>> 3 monitors (VMs)
>>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>>> 16.04
>>> 
>>> Each server has 32 threads @2.1GHz and 128GB RAM.
>>> The disk distribution per server is:
>>> 38 * HUS726020ALS210 (SAS rotational)
>>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>>> 2 * ST1920FM0043 (SAS SSD for data)
>>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>>> 
>>> Since we don't currently have a 10Gbit switch, we test the performance
>>> with the cluster in a degraded state, the noout flag set and we mount
>>> rbd images on the powered on osd node. We confirmed that the network
>>> is not saturated during the tests.
>>> 
>>> We ran tests on the NVME disk and the pool created on this disk where
>>> we hoped to get the most performance without getting limited by the
>>> hardware specs since we have more disks than CPU threads.
>>> 
>>> The nvme disk was at first partitioned with one partition and the
>>> journal on the same disk. The performance on random 4K reads was
>>> topped at 50K iops. We then removed the osd and partitioned with 4
>>> data partitions and 4 journals on the same disk. The performance
>>> didn't increase significantly. Also, since we run read tests, the
>>> journals shouldn't cause performance issues.
>>> 
>>> We then ran 4 fio processes in parallel on the same rbd mounted image
>>> and the total iops reached 100K. More parallel fio processes didn't
>>> increase the measured iops.
>>> 
>>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>>> the crushmap just defines the different buckets/rules for the disk
>>> separation (rotational, ssd, nvme) in order to create the required
>>> pools
>>> 
>>> Is the performance of 100.000 iops for random 4K read normal for a
>>> disk that on the same benchmark runs at more than 300K iops on the
>>> same hardware or are we missing something?
>>> 
>>> Best regards,
>>> Kostas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> 
>  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Willem Jan Withagen

On 24-6-2017 05:30, Christian Wuerdig wrote:
> The general advice floating around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
> something like a E5-2667V4 might bring better results in that situation.
> Also there was some talk about disabling the processor C states in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986)

I would be very careful to call this a general advice...

Although the article is interesting, it is rather single sided.

The only thing is shows that there is a lineair relation between
clockspeed and write or read speeds???
The article is rather vague on how and what is actually tested.

By just running a single OSD with no replication a lot of the
functionality is left out of the equation.
Nobody is running just 1 osD on a box in a normal cluster host.

Not using a serious SSD is another source of noise on the conclusion.
More Queue depth can/will certainly have impact on concurrency.

I would call this an observation, and nothing more.

--WjW
> 
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> > wrote:
> 
> Hello,
> 
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu Server
> 16.04
> 
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
> 
> Since we don't currently have a 10Gbit switch, we test the performance
> with the cluster in a degraded state, the noout flag set and we mount
> rbd images on the powered on osd node. We confirmed that the network
> is not saturated during the tests.
> 
> We ran tests on the NVME disk and the pool created on this disk where
> we hoped to get the most performance without getting limited by the
> hardware specs since we have more disks than CPU threads.
> 
> The nvme disk was at first partitioned with one partition and the
> journal on the same disk. The performance on random 4K reads was
> topped at 50K iops. We then removed the osd and partitioned with 4
> data partitions and 4 journals on the same disk. The performance
> didn't increase significantly. Also, since we run read tests, the
> journals shouldn't cause performance issues.
> 
> We then ran 4 fio processes in parallel on the same rbd mounted image
> and the total iops reached 100K. More parallel fio processes didn't
> increase the measured iops.
> 
> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
> the crushmap just defines the different buckets/rules for the disk
> separation (rotational, ssd, nvme) in order to create the required
> pools
> 
> Is the performance of 100.000 iops for random 4K read normal for a
> disk that on the same benchmark runs at more than 300K iops on the
> same hardware or are we missing something?
> 
> Best regards,
> Kostas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread Willem Jan Withagen

On 22-6-2017 03:59, Christian Balzer wrote:
>> Agreed. On the topic of journals and double bandwidth, am I correct in
>> thinking that btrfs (as insane as it may be) does not require double
>> bandwidth like xfs? Furthermore with bluestore being close to stable, will
>> my architecture need to change?
>>
> BTRFS at this point is indeed a bit insane, given the current levels of
> support, issues (search the ML archives) and future developments. 
> And you'll still wind up with double writes most likely, IIRC.
> 
> These aspects of Bluestore have been discussed here recently, too.
> Your SSD/NVMe space requirements will go down, but if you want to have the
> same speeds and more importantly low latencies you'll wind up with all
> writes going through them again, so endurance wise you're still in that
> "Lets make SSDs great again" hellhole. 

Please note that I know little about btrfs, but its sister ZFS can
include caching/log devices transparent in its architecture. And even
better, they are allowed to fail without much problems. :)

Now the problem I have is that first Ceph journals the writes to its
log, then hands the write over to ZFS, where its gets logged again.
So that are 2 writes, (and in the case of ZFS, they only get read iff
the filesystems had a crash)

Bad thing about ZFS is that the journal log need not be very big: about
5 sec of max required diskwrites. I have 'm a 1Gb and they never filled
up yet. But the used bandwidth is going to doubled due to double the
amount of writes.

If logging of btrfs is anything like this, then you have to look at how
you architecture the filesystems/devices underlying Ceph.

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EXT: ceph-lvm - a tool to deploy OSDs from LVM volumes

2017-06-19 Thread Willem Jan Withagen




Op 19-6-2017 om 19:57 schreef Alfredo Deza:

On Mon, Jun 19, 2017 at 11:37 AM, Willem Jan Withagen <w...@digiware.nl> wrote:

On 19-6-2017 16:13, Alfredo Deza wrote:

On Mon, Jun 19, 2017 at 9:27 AM, John Spray <jsp...@redhat.com> wrote:

On Fri, Jun 16, 2017 at 7:23 PM, Alfredo Deza <ad...@redhat.com> wrote:

On Fri, Jun 16, 2017 at 2:11 PM, Warren Wang - ISD
<warren.w...@walmart.com> wrote:


I would just try to glue it into ceph-disk in the most flexible way

We can't "glue it into ceph-disk" because we are proposing a
completely new way of doing things that
go against how ceph-disk works.


'mmm,

Not really a valid argument if you want the 2 to become equal.

I have limited python knowledge, but I can envision an outer wrapper 
that just call
the old version of ceph-disk as an external executable. User impact is 
thus reduced

to bare minimum.

Got admit that it is not very elegant, but it would work.

But I'll see what you guys come up with.
Best proof is always the code.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EXT: ceph-lvm - a tool to deploy OSDs from LVM volumes

2017-06-19 Thread Willem Jan Withagen

On 19-6-2017 16:13, Alfredo Deza wrote:
> On Mon, Jun 19, 2017 at 9:27 AM, John Spray  wrote:
>> On Fri, Jun 16, 2017 at 7:23 PM, Alfredo Deza  wrote:
>>> On Fri, Jun 16, 2017 at 2:11 PM, Warren Wang - ISD
>>>  wrote:
 I would prefer that this is something more generic, to possibly support 
 other backends one day, like ceph-volume. Creating one tool per backend 
 seems silly.

 Also, ceph-lvm seems to imply that ceph itself has something to do with 
 lvm, which it really doesn’t. This is simply to deal with the underlying 
 disk. If there’s resistance to something more generic like ceph-volume, 
 then it should at least be called something like ceph-disk-lvm.
>>>
>>> Sage, you had mentioned the need for "composable" tools for this, and
>>> I think that if we go with `ceph-volume` we could allow plugins for
>>> each strategy. We are starting with `lvm` support so that would look
>>> like: `ceph-volume lvm`
>>>
>>> The `lvm` functionality could be implemented as a plugin itself, and
>>> when we start working with supporting regular disks, then `ceph-volume
>>> disk` can come along, etc...
>>>
>>> It would also open the door for anyone to be able to write a plugin to
>>> `ceph-volume` to implement their own logic, while at the same time
>>> re-using most of what we are implementing today: logging, reporting,
>>> systemd support, OSD metadata, etc...
>>>
>>> If we were to separate these into single-purpose tools, all those
>>> would need to be re-done.
>>
>> Couple of thoughts:
>>  - let's keep this in the Ceph repository unless there's a strong
>> reason not to -- it'll enable the tool's branching to automatically
>> happen in line with Ceph's.
> 
> For initial development this is easier to have as a separate tool from
> the Ceph source tree. There are some niceties about being in-source,
> like
> not being required to deal with what features we are supporting on what 
> version.

Just my observation, need not be true at all, but ...

As long as you do not have it interact with the other tools, that is
true. But as soon as you start depending on ceph-{disk-new,volume} in
other parts of the mainstream ceph-code you have created a ty-in with
the versioning and will require it to be maintained in the same way.


> Although there is no code yet, I consider the project in an "unstable"
> state, it will move incredibly fast (it has to!) and that puts it at
> odds with the cadence
> of Ceph. Specifically, these two things are very important right now:
> 
> * faster release cycles
> * easier and faster to test
> 
> I am not ruling out going into Ceph at some point though, ideally when
> things slow down and become stable.
> 
> Is your argument only to have parity in Ceph's branching? That was
> never a problem with out-of-tree tools like ceph-deploy for example.

Some of the external targets move so fast (ceph-asible) that I have
given up on trying to see what is going on. For this tool I'd like it to
do the ZFS/FreeBSD stuff as a plugin-module.
In the expectation that it will supersede the current ceph-disk,
otherwise there are 2 place to maintain this type of code.

>>  - I agree with others that a single entrypoint (i.e. executable) will
>> be more manageable than having conspicuously separate tools, but we
>> shouldn't worry too much about making things "plugins" as such -- they
>> can just be distinct code inside one tool, sharing as much or as
>> little as they need.
>>
>> What if we delivered this set of LVM functionality as "ceph-disk lvm
>> ..." commands to minimise the impression that the tooling is changing,
>> even if internally it's all new/distinct code?
> 
> That sounded appealing initially, but because we are introducing a
> very different API, it would look odd to interact
> with other subcommands without a normalized interaction. For example,
> for 'prepare' this would be:
> 
> ceph-disk prepare [...]
> 
> And for LVM it would possible be
> 
> ceph-disk lvm prepare [...]
> 
> The level at which these similar actions are presented imply that one
> may be a preferred (or even default) one, while the other one
> isn't.

Is this about API "cosmetics"? Because there is a lot of examples
suggestions and other stuff out there that is using the old syntax.

And why not do a hybrid? it will require a bit more commandline parsing,
but that is not a major dealbreaker.

so the line would look like
ceph-disk [lvm,zfs,disk,partition] prepare [...]
and the first parameter is optional reverting to the current supported
systems.

You can always start warning users that their API usage is old style,
and that it is going to go away in a next release.

> At one point we are going to add regular disk worfklows (replacing
> ceph-disk functionality) and then it would become even more
> confusing to keep it there (or do you think at that point we could split?)

The more separate you go, the more akward it is going to be when

Re: [ceph-users] EXT: ceph-lvm - a tool to deploy OSDs from LVM volumes

2017-06-16 Thread Willem Jan Withagen

On 16-6-2017 20:23, Alfredo Deza wrote:
> On Fri, Jun 16, 2017 at 2:11 PM, Warren Wang - ISD
>  wrote:
>> I would prefer that this is something more generic, to possibly support 
>> other backends one day, like ceph-volume. Creating one tool per backend 
>> seems silly.
>>
>> Also, ceph-lvm seems to imply that ceph itself has something to do with lvm, 
>> which it really doesn’t. This is simply to deal with the underlying disk. If 
>> there’s resistance to something more generic like ceph-volume, then it 
>> should at least be called something like ceph-disk-lvm.
> 
> Sage, you had mentioned the need for "composable" tools for this, and
> I think that if we go with `ceph-volume` we could allow plugins for
> each strategy. We are starting with `lvm` support so that would look
> like: `ceph-volume lvm`
> 
> The `lvm` functionality could be implemented as a plugin itself, and
> when we start working with supporting regular disks, then `ceph-volume
> disk` can come along, etc...
> 
> It would also open the door for anyone to be able to write a plugin to
> `ceph-volume` to implement their own logic, while at the same time
> re-using most of what we are implementing today: logging, reporting,
> systemd support, OSD metadata, etc...
> 
> If we were to separate these into single-purpose tools, all those
> would need to be re-done.

Looking at my current work I did on ceph-disk for FreeBSD, it starts to
look like something like `ceph-volume zfs`. It would make porting a lot
more manageble. But the composable decomposition needs to be at a level
high enough that as little as possible of linux-isms seep through.
The one that springs to mind in this is: encryption.

So +1

--WjW
> 
> 
>>
>> 2 cents from one of the LVM for Ceph users,
>> Warren Wang
>> Walmart ✻
>>
>> On 6/16/17, 10:25 AM, "ceph-users on behalf of Alfredo Deza" 
>>  wrote:
>>
>> Hello,
>>
>> At the last CDM [0] we talked about `ceph-lvm` and the ability to
>> deploy OSDs from logical volumes. We have now an initial draft for the
>> documentation [1] and would like some feedback.
>>
>> The important features for this new tool are:
>>
>> * parting ways with udev (new approach will rely on LVM functionality
>> for discovery)
>> * compatibility/migration for existing LVM volumes deployed as 
>> directories
>> * dmcache support
>>
>> By documenting the API and workflows first we are making sure that
>> those look fine before starting on actual development.
>>
>> It would be great to get some feedback, specially if you are currently
>> using LVM with ceph (or planning to!).
>>
>> Please note that the documentation is not complete and is missing
>> content on some parts.
>>
>> [0] http://tracker.ceph.com/projects/ceph/wiki/CDM_06-JUN-2017
>> [1] http://docs.ceph.com/ceph-lvm/
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-03 Thread Willem Jan Withagen

On 02-05-17 23:53, David Turner wrote:
> I was only interjecting on the comment "So that is 5 . Which is real
> easy to obtain" and commenting on what the sustained writes into a
> cluster of 2,000 OSDs would require to actually sustain that 5 MBps on
> each SSD journal.

Reading your calculation below I understand where the 2000 comes from.
I meant that hardware of the previous millennium could easily write 5
Mbyte/sec sustained. :)

This does NOT otherwise invalidate your interesting math below.
And your conclusion is an important one.

Go back to 200 OSDs and 25 SSDs and you end up with 40MB/s sustained
writes to wear out your SSDs exactly at then end of the warranty.
Higher sustained writes will linearly shorten your SSD lifetime.

--WjW

> My calculation was off because I forgot replica size, but my corrected
> math is this...
> 
> 5 MBps per journal device
> 8 OSDs per journal (overestimated number as most do 4)
> 2,000 OSDs based on what you said "Which is real easy to obtain, even
> with hardware 0f 2000."
> 3 replicas
> 
> 2,000 OSDs / 8 OSDs per journal = 250 journal SSDs
> 250 SSDs * 5 MBps = 1,250 MBps / 3 replicas = 416.67 MBps required
> sustained cluster write speed to cause each SSD to average 5 MBps on
> each journal device.
> 
> Swap out any variable you want to match your environment.  For example,
> if you only have 4 OSDs per journal device, that number would be double
> for a cluster this size to require a cluster write speed of
> 833.33 MBps to average 5 MBps on each journal.  Also if you have less
> than 2,000 OSDs, then everything shrinks fast.
> 
> 
> On Tue, May 2, 2017 at 5:39 PM Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>> wrote:
> 
> On 02-05-17 19:54, David Turner wrote:
> > Are you guys talking about 5Mbytes/sec to each journal device? 
> Even if
> > you had 8 OSDs per journal and had 2000 osds... you would need a
> > sustained 1.25 Gbytes/sec to average 5Mbytes/sec per journal device.
> 
> I'm not sure I'm following this...
> But I'm rather curious.
> Are you saying that the required journal bandwidth versus OSD write
> bandwidth has an approx 1:200 ratio??
> 
> Note that I took it the other way.
> Given the Intel specs
>  - What sustained bandwidth is allowed to have the device last its
> lifetime.
>  - How much more usage would a 3710 give in regards to a 3520 SSD per
>dollar spent.
> 
> --WjW
> 
> > On Tue, May 2, 2017 at 1:47 PM Willem Jan Withagen
> <w...@digiware.nl <mailto:w...@digiware.nl>
> > <mailto:w...@digiware.nl <mailto:w...@digiware.nl>>> wrote:
> >
> > On 02-05-17 19:16, Дробышевский, Владимир wrote:
> > > Willem,
> > >
> > >   please note that you use 1.6TB Intel S3520 endurance
> rating in your
> > > calculations but then compare prices with 480GB model, which
> has only
> > > 945TBW or 1.1DWPD (
> > >
> >   
>  
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> > > ). It also worth to notice that S3710 has tremendously
> higher write
> > > speed\IOPS and especially SYNC writes. Haven't seen S3520
> real sync
> > > write tests yet but don't think they differ much from S3510
> ones.
> >
> > Arrgh, you are right. I guess I had too many pages open, and
> copied the
> > wrong one.
> >
> > But the good news is that the stats were already in favour of
> the 3710
> > so this only increases that conclusion.
> >
> > The bad news is that the sustained write speed goes down with a
> > factor 4.
> > So that is 5Mbyte/sec. Which is real easy to obtain, even with
> hardware
> > 0f 2000.
> >
> > --WjW
> >
> >
> > > Best regards,
> > > Vladimir
> > >
> > > 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen
> <w...@digiware.nl <mailto:w...@digiware.nl>
> > <mailto:w...@digiware.nl <mailto:w...@digiware.nl>>
> > > <mailto:w...@digiware.nl <mailto:w...@digiware.nl>
> <mailto:w...@digiware.nl <mailto:w...@digiware.nl>>>>:
> > >
> > > On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > > > Hi,
> > > >
> > > >>>

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen

On 02-05-17 19:54, David Turner wrote:
> Are you guys talking about 5Mbytes/sec to each journal device?  Even if
> you had 8 OSDs per journal and had 2000 osds... you would need a
> sustained 1.25 Gbytes/sec to average 5Mbytes/sec per journal device.

I'm not sure I'm following this...
But I'm rather curious.
Are you saying that the required journal bandwidth versus OSD write
bandwidth has an approx 1:200 ratio??

Note that I took it the other way.
Given the Intel specs
 - What sustained bandwidth is allowed to have the device last its lifetime.
 - How much more usage would a 3710 give in regards to a 3520 SSD per
   dollar spent.

--WjW

> On Tue, May 2, 2017 at 1:47 PM Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>> wrote:
> 
> On 02-05-17 19:16, Дробышевский, Владимир wrote:
> > Willem,
> >
> >   please note that you use 1.6TB Intel S3520 endurance rating in your
> > calculations but then compare prices with 480GB model, which has only
> > 945TBW or 1.1DWPD (
> >
> 
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> > ). It also worth to notice that S3710 has tremendously higher write
> > speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> > write tests yet but don't think they differ much from S3510 ones.
> 
> Arrgh, you are right. I guess I had too many pages open, and copied the
> wrong one.
> 
> But the good news is that the stats were already in favour of the 3710
> so this only increases that conclusion.
> 
> The bad news is that the sustained write speed goes down with a
> factor 4.
> So that is 5Mbyte/sec. Which is real easy to obtain, even with hardware
>     0f 2000.
> 
> --WjW
> 
> 
> > Best regards,
> > Vladimir
> >
> > 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>
> > <mailto:w...@digiware.nl <mailto:w...@digiware.nl>>>:
> >
> > On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > > Hi,
> > >
> > >>> What I'm trying to get from the list is /why/ the
> "enterprise" drives
> > >>> are important. Performance? Reliability? Something else?
> > >
> > > performance, for sure (for SYNC write,
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >   
>  
> <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/>)
> > >
> > > Reliabity : yes, enteprise drive have supercapacitor in case
> of powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> > >
> > >
> > >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC
> S3610. Obviously
> > >>> the single drive leaves more bays free for OSD disks, but
> is there any
> > >>> other reason a single S3610 is preferable to 4 S3520s?
> Wouldn't 4xS3520s
> > >>> mean:
> > >
> > > where do you see this price difference ?
> > >
> > > for me , S3520 are around 25-30% cheaper than S3610
> >
> > I just checked for the DCS3520 on
> >   
>  
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> >   
>  
> <https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC>
> >
> > And is has a TBW of 2925 (Terrabytes Write over life time) =
> 2,9 PB
> > the warranty is 5 years.
> >
> > Now if I do the math:
> >   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
> >   which is approx 20Mbyte /sec
> >   or approx 10Gbit/min = 0,15 Gbit/sec
> >
> > And that is only 20% of the capacity of that SATA link.
> > Also writing 20Mbyte/sec sustained is not really that hard for
> modern
> > systems.
> >
> > Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> > so it will last 3 times longer.
> >
> > Checking Amazone, I get
> > $520 for the DC S3710-400G
> > $300 for the DC S3520-480G
> >
> > So that is less than a factor of 2 for using the S37

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen

On 02-05-17 19:16, Дробышевский, Владимир wrote:
> Willem,
> 
>   please note that you use 1.6TB Intel S3520 endurance rating in your
> calculations but then compare prices with 480GB model, which has only
> 945TBW or 1.1DWPD (
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> ). It also worth to notice that S3710 has tremendously higher write
> speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> write tests yet but don't think they differ much from S3510 ones.

Arrgh, you are right. I guess I had too many pages open, and copied the
wrong one.

But the good news is that the stats were already in favour of the 3710
so this only increases that conclusion.

The bad news is that the sustained write speed goes down with a factor 4.
So that is 5Mbyte/sec. Which is real easy to obtain, even with hardware
0f 2000.

--WjW


> Best regards,
> Vladimir
> 
> 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>>:
> 
> On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > Hi,
> >
> >>> What I'm trying to get from the list is /why/ the "enterprise" drives
> >>> are important. Performance? Reliability? Something else?
> >
> > performance, for sure (for SYNC write, 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/>)
> >
> > Reliabity : yes, enteprise drive have supercapacitor in case of 
> powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> >
> >
> >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. 
> Obviously
> >>> the single drive leaves more bays free for OSD disks, but is there any
> >>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 
> 4xS3520s
> >>> mean:
> >
> > where do you see this price difference ?
> >
> > for me , S3520 are around 25-30% cheaper than S3610
> 
> I just checked for the DCS3520 on
> 
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> 
> <https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC>
> 
> And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
> the warranty is 5 years.
> 
> Now if I do the math:
>   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
>   which is approx 20Mbyte /sec
>   or approx 10Gbit/min = 0,15 Gbit/sec
> 
> And that is only 20% of the capacity of that SATA link.
> Also writing 20Mbyte/sec sustained is not really that hard for modern
> systems.
> 
> Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> so it will last 3 times longer.
> 
> Checking Amazone, I get
> $520 for the DC S3710-400G
> $300 for the DC S3520-480G
> 
> So that is less than a factor of 2 for using the S3710's and a 3 times
> longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang for
> your buck.
> 
> But still do not expect your SSDs to last very long if the write rate is
> much over that 20Mbyte/sec
> 
> --WjW
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> 
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 192
> 
> ИТ-консалтинг
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
> Аутсорсинг ИТ-инфраструктуры

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen

On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> Hi,
> 
>>> What I'm trying to get from the list is /why/ the "enterprise" drives 
>>> are important. Performance? Reliability? Something else? 
> 
> performance, for sure (for SYNC write, 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)
> 
> Reliabity : yes, enteprise drive have supercapacitor in case of powerfailure, 
> and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> 
> 
>>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>>> the single drive leaves more bays free for OSD disks, but is there any
>>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>>> mean:
> 
> where do you see this price difference ?
> 
> for me , S3520 are around 25-30% cheaper than S3610

I just checked for the DCS3520 on
https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC

And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
the warranty is 5 years.

Now if I do the math:
  2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
  which is approx 20Mbyte /sec
  or approx 10Gbit/min = 0,15 Gbit/sec

And that is only 20% of the capacity of that SATA link.
Also writing 20Mbyte/sec sustained is not really that hard for modern
systems.

Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
so it will last 3 times longer.

Checking Amazone, I get
$520 for the DC S3710-400G
$300 for the DC S3520-480G

So that is less than a factor of 2 for using the S3710's and a 3 times
longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang for
your buck.

But still do not expect your SSDs to last very long if the write rate is
much over that 20Mbyte/sec

--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why is cls_log_add logging so much?

2017-04-29 Thread Willem Jan Withagen

On 29-04-17 00:16, Gregory Farnum wrote:
> On Tue, Apr 4, 2017 at 2:49 AM, Jens Rosenboom  wrote:
>> On a busy cluster, I'm seeing a couple of OSDs logging millions of
>> lines like this:
>>
>> 2017-04-04 06:35:18.240136 7f40ff873700  0 
>> cls/log/cls_log.cc:129: storing entry at
>> 1_1491287718.237118_57657708.1
>> 2017-04-04 06:35:18.244453 7f4102078700  0 
>> cls/log/cls_log.cc:129: storing entry at
>> 1_1491287718.241622_57657709.1
>> 2017-04-04 06:35:18.296092 7f40ff873700  0 
>> cls/log/cls_log.cc:129: storing entry at
>> 1_1491287718.296308_57657710.1
>>
>> 1. Can someone explain what these messages mean? It seems strange to
>> me that only a few OSD generate these.
>>
>> 2. Why are they being generated at debug level 0, meaning that they
>> cannot be filtered? This should happen for a non-error message that
>> can be generated at least 50 times per second.
> 
> It looks like these are generated by one of object classes which RGW
> uses (for its geo-replication features?). They are indeed generated at
> level 0 and I can't imagine why either, unless it was just a developer
> debug message that didn't get cleaned up.
> I'm sure a patch changing it would be welcome.

There was a similar log-entry, and that was already at level 20.
So upped this one as well.

https://github.com/ceph/ceph/pull/14879

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Willem Jan Withagen

On 4-4-2017 21:05, Gregory Farnum wrote:
> [ Sorry for the empty email there. :o ]
> 
> On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly <pdonn...@redhat.com> wrote:
>> On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen <w...@digiware.nl> wrote:
>>> On 1-4-2017 21:59, Wido den Hollander wrote:
>>>>
>>>>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen <w...@digiware.nl>:
>>>>>
>>>>>
>>>>> On 31-3-2017 17:32, Wido den Hollander wrote:
>>>>>> Hi Willem Jan,
>>>>>>
>>>>>>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>>>>>>> <w...@digiware.nl>:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm pleased to announce that my efforts to port to FreeBSD have
>>>>>>> resulted in a ceph-devel port commit in the ports tree.
>>>>>>>
>>>>>>> https://www.freshports.org/net/ceph-devel/
>>>>>>>
>>>>>>
>>>>>> Awesome work! I don't touch FreeBSD that much, but I can imagine that
>>>>>> people want this.
>>>>>>
>>>>>> Out of curiosity, does this run on ZFS under FreeBSD? Or what
>>>>>> Filesystem would you use behind FileStore with this? Or does
>>>>>> BlueStore work?
>>>>>
>>>>> Since I'm a huge ZFS fan, that is what I run it on.
>>>>
>>>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>>>
>>> Right, ZIL is magic, and more or equal to the journal now used with OSDs
>>> for exactly the same reason. Sad thing is that a write is now 3*
>>> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
>>> bandwidth to the SSDs is double of what it could be.
>>>
>>> Had some discussion about this, but disabling the Ceph journal is not
>>> just setting an option. Although I would like to test performance of an
>>> OSD with just the ZFS journal. But I expect that the OSD journal is
>>> rather firmly integrated.
>>
>> Disabling the OSD journal will never be viable. The journal is also
>> necessary for transactions and batch updates which cannot be done
>> atomically in FileStore.
> 
> To expand on Patrick's statement: You shouldn't get confused by the
> presence of options to disable journaling. They exist but only work on
> btrfs-backed FileStores and are *not* performant. You could do the
> same on zfs, but in order to provide the guarantees of the RADOS
> protocol, when in that mode the OSD just holds replies on all
> operations until it knows they've been persisted to disk and
> snapshotted, then sends back a commit. You can probably imagine the
> horrible IO patterns and bursty application throughput that result.

When I talked about this with Sage in CERN, I got the same answer. So
this is at least consistent. ;-)

And I have to admit that I do not understand the intricate details of
this part of Ceph. So at the moment I'm looking at it from a more global
view

What, i guess, needs to be done, is to get ride of at least one of the
SSD writes.
Which is possible by mounting the journal disk as a separate VDEV (2
SSDs in mirror) and get the max speed out of this.
Problem with this all is that the number of SSDs sort of blows up, and
very likely there is a lot of waste because the journals need not be
very large.

And yes the other way would be to do BlueStore on ZVOL, where the
underlying VDEVs are carefully crafted. But first we need to get AIO
working. And I have not (yet) looked at that at all...

First objective was to get a port of any sorts, which I did last week.
Second is to take Luminous and make a "stable" port which is less of a
moving target.
Only then AIO is on the radar

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-01 Thread Willem Jan Withagen

On 1-4-2017 21:59, Wido den Hollander wrote:
> 
>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen <w...@digiware.nl>:
>>
>>
>> On 31-3-2017 17:32, Wido den Hollander wrote:
>>> Hi Willem Jan,
>>>
>>>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>>>> <w...@digiware.nl>:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I'm pleased to announce that my efforts to port to FreeBSD have
>>>> resulted in a ceph-devel port commit in the ports tree.
>>>>
>>>> https://www.freshports.org/net/ceph-devel/
>>>>
>>>
>>> Awesome work! I don't touch FreeBSD that much, but I can imagine that
>>> people want this.
>>>
>>> Out of curiosity, does this run on ZFS under FreeBSD? Or what
>>> Filesystem would you use behind FileStore with this? Or does
>>> BlueStore work?
>>
>> Since I'm a huge ZFS fan, that is what I run it on.
> 
> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!

Right, ZIL is magic, and more or equal to the journal now used with OSDs
for exactly the same reason. Sad thing is that a write is now 3*
journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
bandwidth to the SSDs is double of what it could be.

Had some discussion about this, but disabling the Ceph journal is not
just setting an option. Although I would like to test performance of an
OSD with just the ZFS journal. But I expect that the OSD journal is
rather firmly integrated.

Now the real nice thing is that one does not need to worry about
cacheing the OSD performance. This is fully covered by ZFS. Both by ARC
and L2ARC. And ZIL and L2ARC can be constructed again in all shapes and
forms that all AFS vdev's can be made.
So for the ZIL you'd build and SSD's mirror: double the write speed, but
still redundant. For L2ARC I'd concatenate 2 SSD's to get the read
bandwidth. And contrary to some of the other caches ZFS does not return
errors if the l2arc devices go down. (note that data errors are detected
by checksumming) So that again is one less thing to worry about.

> CRC and Compression from ZFS are also very nice.

I did not want to go into too much details, but this is a large part of
the reasons. Compression I tried a bit, but does cost quite a bit of
performance at the Ceph end. Perhaps because the write to the journal is
synced, and thus has to way on both compression and synced writting.

It also bring snapshots without much hassle. But I have not yet figured
(looked at) out if and how btrfs snapshots are used.

Other challenge is the Ceph deep scrubbing: checking for corruption
within files. ZFS is able to detect corruption all by itself due to
extensive file checksumming. And with something way much stronger/better
that crc32. (just put on my fireproof suite)
So I'm not certain that deep-scrub would be obsolete, but I think it
could the frequency could perhaps go down, and/or be triggered by ZFS
errors after scrubbing a pool. Something that has way much less impact
on performance.

In some of the talks I give, I always try to explain to people that RAID
and RAID controllers are the current dinosaurs of IT.

>> To be honest I have not tested on UFS, but I would expect that the xattr
>> are not long enough.
>>
>> BlueStore is not (yet) available because there is a different AIO
>> implementation on FreeBSD. But Sage thinks it is very doable to glue in
>> posix AIO. And one of my port reviewers has offered to look at it. So it
>> could be that BlueStore will be available in the foreseeable future.
>>
>> --WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-03-31 Thread Willem Jan Withagen

On 31-3-2017 17:32, Wido den Hollander wrote:
> Hi Willem Jan,
> 
>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>> <w...@digiware.nl>:
>> 
>> 
>> Hi,
>> 
>> I'm pleased to announce that my efforts to port to FreeBSD have
>> resulted in a ceph-devel port commit in the ports tree.
>> 
>> https://www.freshports.org/net/ceph-devel/
>> 
> 
> Awesome work! I don't touch FreeBSD that much, but I can imagine that
> people want this.
> 
> Out of curiosity, does this run on ZFS under FreeBSD? Or what
> Filesystem would you use behind FileStore with this? Or does
> BlueStore work?

Since I'm a huge ZFS fan, that is what I run it on.
To be honest I have not tested on UFS, but I would expect that the xattr
are not long enough.

BlueStore is not (yet) available because there is a different AIO
implementation on FreeBSD. But Sage thinks it is very doable to glue in
posix AIO. And one of my port reviewers has offered to look at it. So it
could be that BlueStore will be available in the foreseeable future.

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] FreeBSD port net/ceph-devel released

2017-03-30 Thread Willem Jan Withagen

Hi,

I'm pleased to announce that my efforts to port to FreeBSD have resulted
in a ceph-devel port commit in the ports tree.

https://www.freshports.org/net/ceph-devel/

I'd like to thank everybody that helped me by answering my questions,
fixing by mistakes, undoing my Git mess. Especially Sage, Kefu and
Haomei gave a lot of support

Next release step will be to release an net/ceph port when the
'Luminous' version goes officially in release.

In the meantime I'll be updating the ceph-devel port to a more current
state of affairs

Thanx,
--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anyone using LVM or ZFS RAID1 for boot drives?

2017-02-13 Thread Willem Jan Withagen

On 13-2-2017 04:22, Alex Gorbachev wrote:
> Hello, with the preference for IT mode HBAs for OSDs and journals,
> what redundancy method do you guys use for the boot drives.  Some
> options beyond RAID1 at hardware level we can think of:
> 
> - LVM
> 
> - ZFS RAID1 mode

Since it is not quite Ceph, I take the liberty to answer with a bit not
Linux. :)

On FreeBSD I always use RAID1 bootdisks, it is natively supoorted from
both kernel and installer. Fits really nice with the upgrading tools,
allowing it to roll-back if upgrades did not work, or to boot with one
of the previous snapshotted bootdisks.

NAME SIZE  ALLOC   FREE  EXPANDSZ   FRAGCAP  DEDUP  HEALTH
zfsroot  228G  2.57G   225G - 0% 1%  1.00x  ONLINE
  mirror 228G  2.57G   225G - 0% 1%
ada0p3  -  -  - -  -  -
ada1p3  -  -  - -  -  -

zfsroot   2.57G   218G19K  /zfsroot
zfsroot/ROOT  1.97G   218G19K  none
zfsroot/ROOT/default  1.97G   218G  1.97G  /
zfsroot/tmp   22.5K   218G  22.5K  /tmp
zfsroot/usr613M   218G19K  /usr
zfsroot/usr/compat  19K   218G19K  /usr/compat
zfsroot/usr/home34K   218G34K  /usr/home
zfsroot/usr/local  613M   218G   613M  /usr/local
zfsroot/usr/ports   19K   218G19K  /usr/ports
zfsroot/usr/src 19K   218G19K  /usr/src
zfsroot/var230K   218G19K  /var
zfsroot/var/audit   19K   218G19K  /var/audit
zfsroot/var/crash   19K   218G19K  /var/crash
zfsroot/var/log135K   218G   135K  /var/log
zfsroot/var/mail19K   218G19K  /var/mail
zfsroot/var/tmp 19K   218G19K  /var/tmp

Live maintenance is also a piece of cake with this.

If on a server SSDs are used, then I add a bit of cache. But as you see
the root stuff, include /usr and likes, is only 2.5Gb. And the most used
part will be in ZFS ARC, certainly if you did not save cost on RAM.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inherent insecurity of OSD daemons when using only a "public network"

2017-01-26 Thread Willem Jan Withagen

On 13-1-2017 12:45, Willem Jan Withagen wrote:
> On 13-1-2017 09:07, Christian Balzer wrote:
>>
>> Hello,
>>
>> Something I came across a while agao, but the recent discussion here
>> jolted my memory.
>>
>> If you have a cluster configured with just a "public network" and that
>> network being in RFC space like 10.0.0.0/8, you'd think you'd be "safe",
>> wouldn't you?
>>
>> Alas you're not:
>> ---
>> root@ceph-01:~# netstat -atn |grep LIST |grep 68
>> tcp0  0 0.0.0.0:68130.0.0.0:*   LISTEN   
>>   
>> tcp0  0 0.0.0.0:68140.0.0.0:*   LISTEN   
>>   
>> tcp0  0 10.0.0.11:6815  0.0.0.0:*   LISTEN   
>>   
>> tcp0  0 10.0.0.11:6800  0.0.0.0:*   LISTEN   
>>   
>> tcp0  0 0.0.0.0:68010.0.0.0:*   LISTEN   
>>   
>> tcp0  0 0.0.0.0:68020.0.0.0:*   LISTEN   
>>   
>> etc..
>> ---
>>
>> Something that people most certainly would NOT expect to be the default
>> behavior.
>>
>> Solution, define a complete redundant "cluster network" that's identical
>> to the public one and voila:
>> ---
>> root@ceph-02:~# netstat -atn |grep LIST |grep 68
>> tcp0  0 10.0.0.12:6816  0.0.0.0:*   LISTEN   
>>   
>> tcp0  0 10.0.0.12:6817  0.0.0.0:*   LISTEN   
>>   
>> tcp0  0 10.0.0.12:6818  0.0.0.0:*   LISTEN   
>>   
>> etc.
>> ---
>>
>> I'd call that a security bug, simply because any other daemon on the
>> planet will bloody bind to the IP it's been told to in its respective
>> configuration.
> 
> I do agree that this would not be the expected result if one specifies
> specific addresses. But it could be that this is how is was designed.
> 
> I have been hacking a bit in the networking code, and my more verbose
> code (HEAD) tells me:
> 1: starting osd.0 at - osd_data td/ceph-helpers/0 td/ceph-helpers/0/journal
> 1: 2017-01-13 12:24:02.045275 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6800/0
> 1: 2017-01-13 12:24:02.045429 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6800/0
> 1: 2017-01-13 12:24:02.045603 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6801/0
> 1: 2017-01-13 12:24:02.045669 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6800/0
> 1: 2017-01-13 12:24:02.045715 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6801/0
> 1: 2017-01-13 12:24:02.045758 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6802/0
> 1: 2017-01-13 12:24:02.045810 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6800/0
> 1: 2017-01-13 12:24:02.045857 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6801/0
> 1: 2017-01-13 12:24:02.045903 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6802/0
> 1: 2017-01-13 12:24:02.045997 b7dc000 -1  Processor -- bind:119 trying
> to bind to 0.0.0.0:6803/0
> 
> So binding factually occurs on 0.0.0.0.
> 
> Here in sequence are bound:
>   Messenger *ms_public = Messenger::create(g_ceph_context,
>   Messenger *ms_cluster = Messenger::create(g_ceph_context,
>   Messenger *ms_hbclient = Messenger::create(g_ceph_context,
>   Messenger *ms_hb_back_server = Messenger::create(g_ceph_context,
>   Messenger *ms_hb_front_server = Messenger::create(g_ceph_context,
>   Messenger *ms_objecter = Messenger::create(g_ceph_context,
> 
> But a specific address indication is not passed.
> 
> I have asked on the dev-list if this is the desired behaviour.
> And if not I'll if I can come up with a fix.

A fix for this has been merged into the HEAD code
https://github.com/ceph/ceph/pull/12929

If you do define public_network and not define cluster_network then
public is used for cluster as well.

Not sure if this will get back-ported to earlier releases.

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inherent insecurity of OSD daemons when using only a "public network"

2017-01-13 Thread Willem Jan Withagen

On 13-1-2017 09:07, Christian Balzer wrote:
> 
> Hello,
> 
> Something I came across a while agao, but the recent discussion here
> jolted my memory.
> 
> If you have a cluster configured with just a "public network" and that
> network being in RFC space like 10.0.0.0/8, you'd think you'd be "safe",
> wouldn't you?
> 
> Alas you're not:
> ---
> root@ceph-01:~# netstat -atn |grep LIST |grep 68
> tcp0  0 0.0.0.0:68130.0.0.0:*   LISTEN
>  
> tcp0  0 0.0.0.0:68140.0.0.0:*   LISTEN
>  
> tcp0  0 10.0.0.11:6815  0.0.0.0:*   LISTEN
>  
> tcp0  0 10.0.0.11:6800  0.0.0.0:*   LISTEN
>  
> tcp0  0 0.0.0.0:68010.0.0.0:*   LISTEN
>  
> tcp0  0 0.0.0.0:68020.0.0.0:*   LISTEN
>  
> etc..
> ---
> 
> Something that people most certainly would NOT expect to be the default
> behavior.
> 
> Solution, define a complete redundant "cluster network" that's identical
> to the public one and voila:
> ---
> root@ceph-02:~# netstat -atn |grep LIST |grep 68
> tcp0  0 10.0.0.12:6816  0.0.0.0:*   LISTEN
>  
> tcp0  0 10.0.0.12:6817  0.0.0.0:*   LISTEN
>  
> tcp0  0 10.0.0.12:6818  0.0.0.0:*   LISTEN
>  
> etc.
> ---
> 
> I'd call that a security bug, simply because any other daemon on the
> planet will bloody bind to the IP it's been told to in its respective
> configuration.

I do agree that this would not be the expected result if one specifies
specific addresses. But it could be that this is how is was designed.

I have been hacking a bit in the networking code, and my more verbose
code (HEAD) tells me:
1: starting osd.0 at - osd_data td/ceph-helpers/0 td/ceph-helpers/0/journal
1: 2017-01-13 12:24:02.045275 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6800/0
1: 2017-01-13 12:24:02.045429 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6800/0
1: 2017-01-13 12:24:02.045603 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6801/0
1: 2017-01-13 12:24:02.045669 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6800/0
1: 2017-01-13 12:24:02.045715 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6801/0
1: 2017-01-13 12:24:02.045758 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6802/0
1: 2017-01-13 12:24:02.045810 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6800/0
1: 2017-01-13 12:24:02.045857 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6801/0
1: 2017-01-13 12:24:02.045903 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6802/0
1: 2017-01-13 12:24:02.045997 b7dc000 -1  Processor -- bind:119 trying
to bind to 0.0.0.0:6803/0

So binding factually occurs on 0.0.0.0.

Here in sequence are bound:
  Messenger *ms_public = Messenger::create(g_ceph_context,
  Messenger *ms_cluster = Messenger::create(g_ceph_context,
  Messenger *ms_hbclient = Messenger::create(g_ceph_context,
  Messenger *ms_hb_back_server = Messenger::create(g_ceph_context,
  Messenger *ms_hb_front_server = Messenger::create(g_ceph_context,
  Messenger *ms_objecter = Messenger::create(g_ceph_context,

But a specific address indication is not passed.

I have asked on the dev-list if this is the desired behaviour.
And if not I'll if I can come up with a fix.

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-11 Thread Willem Jan Withagen

On 11-1-2017 08:06, Adrian Saul wrote:
> 
> I would concur having spent a lot of time on ZFS on Solaris.
> 
> ZIL will reduce the fragmentation problem a lot (because it is not
> doing intent logging into the filesystem itself which fragments the
> block allocations) and write response will be a lot better.  I would
> use different devices for L2ARC and ZIL - ZIL needs to be small and
> fast for writes (and mirrored - we have used some HGST 16G devices
> which are designed as ZILs - pricy but highly recommend) - L2ARC just
> needs to be faster for reads than your data disks, most SSDs would be
> fine for this.

Been using ZFS on FreeBSD ever since 2006, an I really like it.
Other than that it does not scale horizontally.

Ceph does a lot of sync()-type calls.
If you do not have a ZIL on SSDs, then ZFS create a ZIL on HHD for the
sync() writes
Most of the documentation then talks about using that to reliably speed
up NFS. But it is actually for ANY sync() operation.

> A 14 disk RAIDZ2 is also going to be very poor for writes especially
> with SATA - you are effectively only getting one disk worth of IOPS
> for write as each write needs to hit all disks.  Without a ZIL you
> are also losing out on write IOPS for ZIL and metadata operations.

I would definitely not have used a RAIDZ2 if speed is of the utmost
importance. It has it's advantages, but now you are both using ZFS's
redundancy AND the redundancy that is in CEPH.
So 2 extra HDD's in ZFS, and then on to off that the CEPH redundancy.

I haven't tried a large cluster yet, but if money allows it my choice
would be 2 disks mirrors per OSD in a vdev-pool. And use that with a ZIL
on SSD. This gives you 2* write speed IOPS of the disks.
Using the raid-types does not give you much extras for speed when tere
are more spindles.

One of the things that would be tempting is to even have only 1 disk in
a vdev, and let ceph do the rest. Problem is that you will need to
ZFS-scrub more often, and repair manually. Because errors will be
detected, but cannot be repaired.

We have not even discussed compression in ZFS, because that again is a
large way of getting more speed out of the system...

There are also some questions that I'm wondering about:
 - L2ARC uses (lots of) core memory, so do the OSDs and then there is
the buffer. All these interact, and compete for free RAM.
   What mix is sensible and gets most out of the memory you have.
 - If you have a fast ZIL, would you still need a journal in Ceph?

Just my 2cts,
--WjW


>> -Original Message- From: ceph-users
>> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Patrick
>> Donnelly Sent: Wednesday, 11 January 2017 5:24 PM To: Kevin
>> Olbrich Cc: Ceph Users Subject: Re: [ceph-users] Review of Ceph on
>> ZFS - or how not to deploy Ceph for RBD + OpenStack
>> 
>> Hello Kevin,
>> 
>> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
>>> 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700
>>> NVMe journal,
>> 
>> Is the "journal" used as a ZIL?
>> 
>>> We experienced a lot of io blocks (X requests blocked > 32 sec)
>>> when a lot of data is changed in cloned RBDs (disk imported via
>>> OpenStack Glance, cloned during instance creation by Cinder). If
>>> the disk was cloned some months ago and large software updates
>>> are applied (a lot of small files) combined with a lot of syncs,
>>> we often had a node hit suicide timeout. Most likely this is a
>>> problem with op thread count, as it is easy to block threads with
>>> RAIDZ2 (RAID6) if many small operations are written to disk
>>> (again, COW is not optimal here). When recovery took place
>>> (0.020% degraded) the cluster performance was very bad - remote
>>> service VMs (Windows) were unusable. Recovery itself was using 70
>>> - 200 mb/s which was okay.
>> 
>> I would think having an SSD ZIL here would make a very large
>> difference. Probably a ZIL may have a much larger performance
>> impact than an L2ARC device. [You may even partition it and have
>> both but I'm not sure if that's normally recommended.]
>> 
>> Thanks for your writeup!
>> 
>> -- Patrick Donnelly 
>> ___ ceph-users mailing
>> list ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and
> may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the
> named addressee(s). They may only be copied, distributed or disclosed
> with the consent of the copyright owner. If you have received this
> email by mistake or by breach of the confidentiality clause, please
> notify the sender immediately by return email and delete or destroy
> all copies of the email. Any confidentiality, privilege or copyright
> is not waived or lost because this email has been sent to you by
> mistake. ___ ceph-users
>

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-10 Thread Willem Jan Withagen

On 10-1-2017 20:35, Lionel Bouton wrote:
> Hi,

I usually don't top post, but this time it is just to agree whole
hartedly with what you wrote. And you have again more arguements as to why.

Using SSD that don't work right is a certain recipe for losing data.

--WjW

> Le 10/01/2017 à 19:32, Brian Andrus a écrit :
>> [...]
>>
>>
>> I think the main point I'm trying to address is - as long as the
>> backing OSD isn't egregiously handling large amounts of writes and it
>> has a good journal in front of it (that properly handles O_DSYNC [not
>> D_SYNC as Sebastien's article states]), it is unlikely inconsistencies
>> will occur upon a crash and subsequent restart.
> 
> I don't see how you can guess if it is "unlikely". If you need SSDs you
> are probably handling relatively large amounts of accesses (so large
> amounts of writes aren't unlikely) or you would have used cheap 7200rpm
> or even slower drives.
> 
> Remember that in the default configuration, if you have any 3 OSDs
> failing at the same time, you have chances of losing data. For <30 OSDs
> and size=3 this is highly probable as there are only a few thousands
> combinations of 3 OSDs possible (and you usually have typically a
> thousand or 2 of pgs picking OSDs in a more or less random pattern).
> 
> With SSDs not handling write barriers properly I wouldn't bet on
> recovering the filesystems of all OSDs properly given a cluster-wide
> power loss shutting down all the SSDs at the same time... In fact as the
> hardware will lie about the stored data, the filesystem might not even
> detect the crash properly and might apply its own journal on outdated
> data leading to unexpected results.
> So losing data is a possibility and testing for it is almost impossible
> (you'll have to reproduce all the different access patterns your Ceph
> cluster could experience at the time of a power loss and trigger the
> power losses in each case).
> 
>>
>> Therefore - while not ideal to rely on journals to maintain consistency,
> 
> Ceph journals aren't designed for maintaining the filestore consistency.
> They *might* restrict the access patterns to the filesystems in such a
> way that running fsck on them after a "let's throw away committed data"
> crash might have better chances of restoring enough data but if it's the
> case it's only an happy coincidence (and you will have to run these
> fscks *manually* as the filesystem can't detect inconsistencies by itself).
> 
>> that is what they are there for.
> 
> No. They are here for Ceph internal consistency, not the filesystem
> backing the filestore consistency. Ceph relies both on journals and
> filesystems able to maintain internal consistency and supporting syncfs
> to maintain consistency, if the journal or the filesystem fails the OSD
> is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you
> enter "probable data loss" territory.
> 
>> There is a situation where "consumer-grade" SSDs could be used as
>> OSDs. While not ideal, it can and has been done before, and may be
>> preferable to tossing out $500k of SSDs (Seen it firsthand!)
> 
> For these I'd like to know :
> - which SSD models were used ?
> - how long did the SSDs survive (some consumer SSDs not only lie to the
> system about write completions but they usually don't handle large
> amounts of write nearly as well as DC models) ?
> - how many cluster-wide power losses did the cluster survive ?
> - what were the access patterns on the cluster during the power losses ?
> 
> If for a model not guaranteed for sync writes there hasn't been dozens
> of power losses on clusters under large loads without any problem
> detected in the week following (thing deep-scrub), using them is playing
> Russian roulette with your data.
> 
> AFAIK there have only been reports of data losses and/or heavy
> maintenance later when people tried to use consumer SSDs (admittedly
> mainly for journals). I've yet to spot long-running robust clusters
> built with consumer SSDs.
> 
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-09 Thread Willem Jan Withagen

On 9-1-2017 23:58, Brian Andrus wrote:
> Sorry for spam... I meant D_SYNC.

That term does not run any lights in Google...
So I would expect it has to O_DSYNC.
(https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)

Now you tell me there is a SSDs that does take correct action with
O_SYNC but not with O_DSYNC... That makes no sense to me. It is a
typical solution in the OS as speed trade-off versus a bit less
consistent FS.

Either a device actually writes its data persistenly (either in silicon
cells, or keeps it in RAM with a supercapacitor), or it does not.
Something else I can not think off. Maybe my EE background is sort of in
the way here. And I know that is rather hard to write correct SSD
firmware, I seen lots of firmware upgrades to actually fix serious
corner cases.

Now the second thing is how hard does a drive lie when being told that
the request write is synchronised. And Oke is only returned when data is
in stable storage, and can not be lost.

If there is a possibility that a sync write to a drive is not
persistent, then that is a serious breach of the sync write contract.
There will always be situations possible that these drives will lose data.
And if data is no longer in the journal, because the writing process
thinks the data is on stable storage it has deleted the data from the
journal. In this case that data is permanently lost.

Now you have a second chance (even a third) with Ceph, because data is
stored multiple times. And you can go to another OSD and try to get it back.

--WjW

> 
> On Mon, Jan 9, 2017 at 2:56 PM, Brian Andrus <brian.and...@dreamhost.com
> <mailto:brian.and...@dreamhost.com>> wrote:
> 
> Hi Willem, the SSDs are probably fine for backing OSDs, it's the
> O_DSYNC writes they tend to lie about.
> 
> They may have a failure rate higher than enterprise-grade SSDs, but
> are otherwise suitable for use as OSDs if journals are placed elsewhere.
> 
> On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <w...@digiware.nl
> <mailto:w...@digiware.nl>> wrote:
> 
> On 9-1-2017 18:46, Oliver Humpage wrote:
> >
> >> Why would you still be using journals when running fully OSDs on
> >> SSDs?
> >
> > In our case, we use cheaper large SSDs for the data (Samsung 850 Pro
> > 2TB), whose performance is excellent in the cluster, but as has been
> > pointed out in this thread can lose data if power is suddenly
> > removed.
> >
> > We therefore put journals onto SM863 SSDs (1 journal SSD per 3 OSD
> > SSDs), which are enterprise quality and have power outage 
> protection.
> > This seems to balance speed, capacity, reliability and budget fairly
> > well.
> 
> This would make me feel very uncomfortable.
> 
> So you have a reliable journal, so upto there thing do work:
>   Once in the journal you data is safe.
> 
> But then you async transfer the data to disk. And that is an SSD
> that
> lies to you? It will tell you that the data is written. But if
> you pull
> the power, then it turns out that the data is not really stored.
> 
> And then the only way to get the data consistent again, is to
> (deep)scrub.
> 
> Not a very appealing lookout??
> 
> --WjW
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> Brian Andrus
> Cloud Systems Engineer
> DreamHost, LLC
> 
> 
> 
> 
> -- 
> Brian Andrus
> Cloud Systems Engineer
> DreamHost, LLC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-09 Thread Willem Jan Withagen

On 9-1-2017 18:46, Oliver Humpage wrote:
> 
>> Why would you still be using journals when running fully OSDs on
>> SSDs?
> 
> In our case, we use cheaper large SSDs for the data (Samsung 850 Pro
> 2TB), whose performance is excellent in the cluster, but as has been
> pointed out in this thread can lose data if power is suddenly
> removed.
> 
> We therefore put journals onto SM863 SSDs (1 journal SSD per 3 OSD
> SSDs), which are enterprise quality and have power outage protection.
> This seems to balance speed, capacity, reliability and budget fairly
> well.

This would make me feel very uncomfortable.

So you have a reliable journal, so upto there thing do work:
  Once in the journal you data is safe.

But then you async transfer the data to disk. And that is an SSD that
lies to you? It will tell you that the data is written. But if you pull
the power, then it turns out that the data is not really stored.

And then the only way to get the data consistent again, is to (deep)scrub.

Not a very appealing lookout??

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-08 Thread Willem Jan Withagen

On 7-1-2017 15:03, Lionel Bouton wrote:
> Le 07/01/2017 à 14:11, kevin parrikar a écrit :
>> Thanks for your valuable input.
>> We were using these SSD in our NAS box(synology)  and it was giving
>> 13k iops for our fileserver in raid1.We had a few spare disks which we
>> added to our ceph nodes hoping that it will give good performance same
>> as that of NAS box.(i am not comparing NAS with ceph ,just the reason
>> why we decided to use these SSD)
>>
>> We dont have S3520 or S3610 at the moment but can order one of these
>> to see how it performs in ceph .We have 4xS3500  80Gb handy.
>> If i create a 2 node cluster with 2xS3500 each and with replica of
>> 2,do you think it can deliver 24MB/s of 4k writes .
> 
> Probably not. See
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> According to the page above the DC S3500 reaches 39MB/s. Its capacity
> isn't specified, yours are 80GB only which is the lowest capacity I'm
> aware of and for all DC models I know of the speed goes down with the
> capacity so you probably will get lower than that.
> If you put both data and journal on the same device you cut your
> bandwidth in half : so this would give you an average <20MB/s per OSD
> (with occasional peaks above that if you don't have a sustained 20MB/s).
> With 4 OSDs and size=2, your total write bandwidth is <40MB/s. For a
> single stream of data you will only get <20MB/s though (you won't
> benefit from parallel writes to the 4 OSDs and will only write on 2 at a
> time).

I'm new to this part of tuning ceph, but I do have an architectual
discussion:

Why would you still be using journals when running fully OSDs on SSDs?

When using a journal the data is first written to a journal, and then
that same data is (later on) written again to disk.
This in the assumption that the time to write the journal is only a
fraction of the time it costs to write to disk. And since writing data
to stable storage in on the critical path, the journal brings an advantage.

Now when the disk is already on SSD, I see very little difference in
writing the data directly to disk en forgo the journal.
I would imagine that not using journals would cut writing time in have
because the data is only written once. There is no loss of bandwidth on
the SSD, and internally the SSD does not have to manage double the
amount erase cycles in garbage collection once the SDD comes close to
being fully used.

The only thing I can imagine that makes a difference is that journal
writing is slightly faster than writing data into the FS that is used
for the disk. But that should not be such a major extra cost that it
warrants all the other disadvantages.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

62 matches

Mail list logo