Re: [ceph-users] SSD Hardware recommendation

2015-03-31 Thread Adam Tygart
Speaking of SSD IOPs. Running the same tests on my SSDs (LiteOn ECT-480N9S 480GB SSDs): The lines at the bottom are a single 6TB spinning disk for comparison's sake. http://imgur.com/a/fD0Mh Based on these numbers, there is a minimum latency per operation, but multiple operations can be performed

Re: [ceph-users] mds crashing

2015-04-15 Thread Adam Tygart
We are using 3.18.6-gentoo. Based on that, I was hoping that the kernel bug referred to in the bug report would have been fixed. -- Adam On Wed, Apr 15, 2015 at 8:02 PM, Yan, Zheng wrote: > On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson wrote: >> Thank you, John! >> >> That was exactly the bug we

Re: [ceph-users] mds crashing

2015-04-15 Thread Adam Tygart
What is significantly smaller? We have 67 requests in the 16,400,000 range and 250 in the 18,900,000 range. Thanks, Adam On Wed, Apr 15, 2015 at 8:38 PM, Yan, Zheng wrote: > On Thu, Apr 16, 2015 at 9:07 AM, Adam Tygart wrote: >> We are using 3.18.6-gentoo. Based on that, I was hoping

Re: [ceph-users] mds crashing

2015-04-16 Thread Adam Tygart
Adam On Thu, Apr 16, 2015 at 1:35 AM, Yan, Zheng wrote: > On Thu, Apr 16, 2015 at 10:44 AM, Adam Tygart wrote: >> We did that just after Kyle responded to John Spray above. I am >> rebuilding the kernel now to include dynamic printk support. >> > > Maybe the first cra

Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-25 Thread Adam Tygart
We're currently putting data into our cephfs pool (cachepool in front of it as a caching tier), but the metadata pool contains ~50MB of data for 36 million files. If that were an accurate estimation, we'd have a metadata pool closer to ~140GB. Here is a ceph df detail: http://people.beocat.cis.ksu

Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-25 Thread Adam Tygart
s each. How > do you have things configured, exactly? > > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote: >> >> We're currently putting data into our cephfs pool (cachepool in front >> of it as a caching tier), but the metadata pool contains ~50MB of data >> f

Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-25 Thread Adam Tygart
nt to think the pg statistics reporting is going > wrong somehow. > ...I bet the leveldb/omap stuff isn't being included in the of statistics. > That could be why and would make sense with what you've got here. :) > -Greg > On Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote: >&g

[ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
Hello all, The ceph-mds servers in our cluster are performing a constant boot->replay->crash in our systems. I have enable debug logging for the mds for a restart cycle on one of the nodes[1]. Kernel debug from cephfs client during reconnection attempts: [732586.352173] ceph: mdsc delayed_work

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
then you can see where it's at when it crashes. > > --Lincoln > > On May 22, 2015, at 9:33 AM, Adam Tygart wrote: > >> Hello all, >> >> The ceph-mds servers in our cluster are performing a constant >> boot->replay->crash in our systems. >> >

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
r multiple active MDS? > > --Lincoln > > On May 22, 2015, at 10:10 AM, Adam Tygart wrote: > >> Thanks for the quick response. >> >> I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one: >> https://drive.google.com/f

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
On Fri, May 22, 2015 at 11:47 AM, John Spray wrote: > > > On 22/05/2015 15:33, Adam Tygart wrote: >> >> Hello all, >> >> The ceph-mds servers in our cluster are performing a constant >> boot->replay->crash in our systems. >> >> I have enabl

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
sed them and flushed them out of the caches. I also would have thought that a close file/flush in rsync (which I am sure it does, after finishing writing a file) would have let them close in the cephfs session. -- Adam On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum wrote: > On Fri, May 22, 2

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
. -- Adam On Fri, May 22, 2015 at 2:39 PM, Gregory Farnum wrote: > On Fri, May 22, 2015 at 12:34 PM, Adam Tygart wrote: >> I believe I grabbed all of theses files: >> >> for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p >> metadat

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
Alright, bumping that up 10 worked. the MDS server came up and "recovered". Took about 1 minute. Thanks again, guys. -- Adam On Fri, May 22, 2015 at 2:50 PM, Gregory Farnum wrote: > On Fri, May 22, 2015 at 12:45 PM, Adam Tygart wrote: >> Fair enough. Anyway, is it safe

[ceph-users] xattrs vs omap

2015-07-01 Thread Adam Tygart
Hello all, I've got a coworker who put "filestore_xattr_use_omap = true" in the ceph.conf when we first started building the cluster. Now he can't remember why. He thinks it may be a holdover from our first Ceph cluster (running dumpling on ext4, iirc). In the newly built cluster, we are using XF

Re: [ceph-users] Difference between CephFS and RBD

2015-07-06 Thread Adam Tygart
CephFS doesn't use RBD, it uses the Rados protocol (that RBD uses behind the scenes). You can set striping parameters for files in CephFS, though, just as you can for RBD. The real problem here, as I understand it, is simultaneous access to the same file. Write-locks happen at the file level, so m

[ceph-users] CephFS New EC Data Pool

2015-07-21 Thread Adam Tygart
Hello all, I'm trying to add a new data pool to CephFS, as we need some longer term archival storage. ceph mds add_data_pool archive Error EINVAL: can't use pool 'archive' as it's an erasure-code pool Here are the steps taken to create the pools for this new datapool: ceph osd pool create arccac

[ceph-users] CephFS "corruption" -- Nulled bytes

2015-09-25 Thread Adam Tygart
Hello all, I've run into some sort of bug with CephFS. Client reads of a particular file return nothing but 40KB of Null bytes. Doing a rados level get of the inode returns the whole file, correctly. Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client. Attached is a dynamic prin

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-09-25 Thread Adam Tygart
part of infernalis. The > original reproducer involved truncating/overwriting files. In your example, > do you know if 'kstat' has been truncated/overwritten prior to generating > the md5sums? > > On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart wrote: >> >> Hello

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-09-27 Thread Adam Tygart
I've done some digging into cp and mv's semantics (from coreutils). If the inode is existing, the file will get truncated, then data will get copied in. This is definitely within the scope of the bug above. -- Adam On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart wrote: > It may have b

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-05 Thread Adam Tygart
y way to figure out what happened? -- Adam On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart wrote: > I've done some digging into cp and mv's semantics (from coreutils). If > the inode is existing, the file will get truncated, then data will get > copied in. This is definitely wit

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-07 Thread Adam Tygart
ts 'parent', and doing the same on the parent directory's inode simply lists 'parent'. Thanks for your time. -- Adam On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil wrote: > On Mon, 5 Oct 2015, Adam Tygart wrote: >> Okay, this has happened several more times. Alway

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-14 Thread Adam Tygart
Oct 8, 2015 at 11:11 AM, Lincoln Bryant wrote: > Hi Sage, > > Will this patch be in 0.94.4? We've got the same problem here. > > -Lincoln > >> On Oct 8, 2015, at 12:11 AM, Sage Weil wrote: >> >> On Wed, 7 Oct 2015, Adam Tygart wrote: >>> Does this

Re: [ceph-users] Erasure coded pools and 'feature set mismatch' issue

2015-11-08 Thread Adam Tygart
The problem is that "hammer" tunables (i.e. "optimal" in v0.94.x) are incompatible with the kernel interfaces before Linux 4.1 (namely due to straw2 buckets). To make use of the kernel interfaces in 3.13, I believe you'll need "firefly" tunables. -- Adam On Sun, Nov 8, 2015 at 11:48 PM, Bogdan SO

Re: [ceph-users] hadoop on cephfs

2016-04-30 Thread Adam Tygart
Supposedly cephfs-hadoop worked and/or works on hadoop 2. I am in the process of getting it working with cdh5.7.0 (based on hadoop 2.6.0). I'm under the impression that it is/was working with 2.4.0 at some point in time. At this very moment, I can use all of the DFS tools built into hadoop to crea

[ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart
A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt At the moment, we're dead in the water. I would appreciate

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart
> 6: (()+0x7dc5) [0x7f35146ecdc5] > 7: (clone()+0x6d) [0x7f3512d77ced] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to > interpret this. > 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(const ceph::hear

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart
ndon referred to. Which in CephFS can be fixed by either > having smaller folders or (if you're very nervy, and ready to turn on > something we think works but don't test enough) enabling directory > fragmentation. > -Greg > > On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Adam Tygart
kR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool > Hopefully it works in your case too and you can the cluster back to a state > that you can make the CephFS directories smaller. > > - Brandon > > On Wed, Jun 1, 2016 at 4:22 PM, Gregory Farnum wrote: >> >> O

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Adam Tygart
I'm still exporting pgs out of some of the downed osds, but things are definitely looking promising. Marginally related to this thread, as these seem to be most of the hanging objects when exporting pgs, what are inodes in the 600 range used for within the metadata pool? I know the 200 range is us

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Adam Tygart
m). Could that cause this fragmentation that we think is the issue? -- Adam On Thu, Jun 2, 2016 at 10:32 PM, Adam Tygart wrote: > I'm still exporting pgs out of some of the downed osds, but things are > definitely looking promising. > > Marginally related to this thread, as thes

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Adam Tygart
With regards to this export/import process, I've been exporting a pg from an osd for more than 24 hours now. The entire OSD only has 8.6GB of data. 3GB of that is in omap. The export for this particular PG is only 108MB in size right now, after more than 24 hours. How is it possible that a fragment

Re: [ceph-users] Best upgrade strategy

2016-06-05 Thread Adam Tygart
If your monitor nodes are separate from the osd nodes, I'd get ceph upgraded to the latest point release of your current line (0.94.7). Upgrade monitors, then osds, then other dependent services (mds, rgw, qemu). Once everything is happy again, I'd run OS and ceph upgrades together, starting with m

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-06 Thread Adam Tygart
Would it be beneficial for anyone to have an archive copy of an osd that took more than 4 days to export. All but an hour of that time was spent exporting 1 pg (that ended up being 197MB). I can even send along the extracted pg for analysis... -- Adam On Fri, Jun 3, 2016 at 2:39 PM, Adam Tygart

Re: [ceph-users] RDMA/Infiniband status

2016-06-09 Thread Adam Tygart
IPoIB is done with broadcast packets on the Infiniband fabric. Most switches and opensm (by default) setup a broadcast group at the lowest IB speed (SDR), to support all possible IB connections. If you're using pure DDR, you may need to tune the broadcast group in your subnet manager to set the spe

Re: [ceph-users] RDMA/Infiniband status

2016-06-09 Thread Adam Tygart
I believe this is what you want: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html -- Adam On Thu, Jun 9, 2016 at 10:01 AM, Gandalf Corvotempesta wrote: > Il 09 giu 2016 15:41, "Adam Tygart"

Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Adam Tygart
This sounds an awful lot like a a bug I've run into a few times (not often enough to get a good backtrace out of the kernel or mds) involving vim on a symlink to a file in another directory. It will occasionally corrupt the symlink in such a way that the symlink is unreadable. Filling dmesg with:

Re: [ceph-users] Bluestore RAM usage/utilization

2016-06-16 Thread Adam Tygart
According to Sage[1] Bluestore makes use of the pagecache. I don't believe read-ahead is a filesystem tunable in Linux, it is set on the block device itself, therefore read-ahead shouldn't be an issue. I'm not familiar enough with Bluestore to comment on the rest. [1] http://www.spinics.net/lists

Re: [ceph-users] Issues with CephFS

2016-06-18 Thread Adam Tygart
Responses inline. On Sat, Jun 18, 2016 at 4:53 PM, ServerPoint wrote: > Hi, > > I am trying to setup a Ceph cluster and mount it as CephFS > > These are the steps that I followed : > - > ceph-deploy new mon > ceph-deploy install admin mon node2 nod

Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread Adam Tygart
Some people are doing hyperconverged ceph, colocating qemu virtualization with ceph-osds. It is relevant for a decent subset of people here. Therefore knowledge of the degree of performance degradation is useful. -- Adam On Thu, Jan 11, 2018 at 11:38 AM, wrote: > I don't understand how all of t

[ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart
Hello all, I've gotten myself into a bit of a bind. I was prepping to add a new mds node to my ceph cluster. e.g. ceph-deploy mds create mormo Unfortunately, it started the mds server before I was ready. My cluster was running 10.2.1, and the newly deployed mds is 10.2.3. This caused 3 of my 5 m

Re: [ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart
I could, I suppose, update the monmaps in the working monitors to remove the broken ones and then re-deploy the broken ones. The main concern I have is that if the mdsmap update isn't pending on the working ones, what else isn't in sync. Thoughts? -- Adam On Fri, Sep 30, 2016 at 11:0

[ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart
Hello all, Not sure if this went through before or not, as I can't check the mailing list archives. I've gotten myself into a bit of a bind. I was prepping to add a new mds node to my ceph cluster. e.g. ceph-deploy mds create mormo Unfortunately, it started the mds server before I was ready. My

Re: [ceph-users] Down monitors after adding mds node

2016-10-01 Thread Adam Tygart
e and standby servers while initializing the mons? I would hope that, now that all the versions are in sync, a bad standby_for_fscid would not be possible with new mds servers starting. -- Adam On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum wrote: > On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygar

Re: [ceph-users] Down monitors after adding mds node

2016-10-02 Thread Adam Tygart
Sent before I was ready, oops. How might I get the osdmap from a down cluster? -- Adam On Mon, Oct 3, 2016 at 12:29 AM, Adam Tygart wrote: > I put this in the #ceph-dev on Friday, > > (gdb) print info > $7 = (const MDSMap::mds_info_t &) @0x5fb1da68: { >

Re: [ceph-users] MDS damaged

2018-07-12 Thread Adam Tygart
I've hit this today with an upgrade to 12.2.6 on my backup cluster. Unfortunately there were issues with the logs (in that the files weren't writable) until after the issue struck. 2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log [ERR] : 5.255 full-object read crc 0x4e97b4e != e

Re: [ceph-users] MDS damaged

2018-07-13 Thread Adam Tygart
Bluestore. On Fri, Jul 13, 2018, 05:56 Dan van der Ster wrote: > Hi Adam, > > Are your osds bluestore or filestore? > > -- dan > > > On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart wrote: > > > > I've hit this today with an upgrade to 12.2.6 on my bac

Re: [ceph-users] MDS damaged

2018-07-15 Thread Adam Tygart
Check out the message titled "IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade" It sounds like 12.2.7 should come *soon* to fix this transparently. -- Adam On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard wrote: > Hi all, > > I have the same problem here: > * during the upgra

Re: [ceph-users] removing auids and auid-based cephx capabilities

2018-08-11 Thread Adam Tygart
I don't care what happens to most of these rados commands, and I've never used the auid "functionality", but I have found the rados purge command quite useful when testing different rados level applications. Run a rados-level application test. Whoops it didn't do what you wanted, purge and start o

Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-08-27 Thread Adam Tygart
This issue was related to using Jemalloc. Jemalloc is not as well tested with Bluestore and lead to lots of segfaults. We moved back to the default of tcmalloc with Bluestore and these stopped. Check /etc/sysconfig/ceph under RHEL based distros. -- Adam On Mon, Aug 27, 2018 at 9:51 PM Tyler Bisho

[ceph-users] Cephfs mds cache tuning

2018-09-30 Thread Adam Tygart
Hello all, I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000 cephfs mounts (kernel client). We're currently only using 1 active mds. Performance is great about 80% of the time. MDS responses (per ceph daemonperf mds.$(hostname -s), indicates 2k-9k requests per second, with a la

Re: [ceph-users] Cephfs mds cache tuning

2018-10-01 Thread Adam Tygart
DS needing to catch up on log trimming (though I’m unclear why changing the > cache size would impact this). > > On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart wrote: >> >> Hello all, >> >> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000

Re: [ceph-users] Cephfs mds cache tuning

2018-10-02 Thread Adam Tygart
g with ops, ops_in_flight, perf dump and objecter requests. Thanks for your time. -- Adam On Mon, Oct 1, 2018 at 10:36 PM Adam Tygart wrote: > > Okay, here's what I've got: https://www.paste.ie/view/abe8c712 > > Of note, I've changed things up a little bit for the moment. I've

[ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
Hello all, I'm having some stability issues with my ceph cluster at the moment. Using CentOS 7, and Ceph 12.2.4. I have osds that are segfaulting regularly. roughly every minute or so, and it seems to be getting worse, now with cascading failures. Backtraces look like this: ceph version 12.2.4

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
Well, the cascading crashes are getting worse. I'm routinely seeing 8-10 of my 518 osds crash. I cannot start 2 of them without triggering 14 or so of them to crash repeatedly for more than an hour. I've ran another one of them with more logging, debug osd = 20; debug ms = 1 (definitely more than

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
:11 PM, Josh Durgin wrote: >> >> On 04/05/2018 06:15 PM, Adam Tygart wrote: >>> >>> Well, the cascading crashes are getting worse. I'm routinely seeing >>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering >>> 14 or so of them

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-06 Thread Adam Tygart
ecially the > complete_read_op one from http://tracker.ceph.com/issues/21931 > > Josh > > > On 04/05/2018 08:25 PM, Adam Tygart wrote: >> >> Thank you! Setting norecover has seemed to work in terms of keeping >> the osds up. I am glad my logs were of use to trackin

Re: [ceph-users] Adding a new rack to crush map without pain?

2017-04-18 Thread Adam Tygart
Ceph has the ability to us a script to figure out where in the crushmap this disk should go (on osd start): http://docs.ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook -- Adam On Tue, Apr 18, 2017 at 7:53 AM, Matthew Vernon wrote: > On 17/04/17 21:16, Richard Hesse wrot

[ceph-users] Race Condition(?) in CephFS

2017-04-25 Thread Adam Tygart
I'm using CephFS, on CentOS 7. We're currently migrating away from using a catch-all cephx key to mount the filesystem (with the kernel module), to a much more restricted key. In my tests, I've come across an issue, extracting a tar archive with a mount using the restricted key routinely cannot cr

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Adam Tygart
As I understand it: 4.2G is used by ceph (all replication, metadata, et al) it is a sum of all the space "used" on the osds. 958M is the actual space the data in cephfs is using (without replication). 3.8G means you have some sparse files in cephfs. 'ceph df detail' should return something close

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Adam Tygart
It appears that with --apparent-size, du adds the "size" of the directories to the total as well. On most filesystems this is the block size, or the amount of metadata space the directory is using. On CephFS, this size is fabricated to be the size sum of all sub-files. i.e. a cheap/free 'du -sh $fo

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart
The docs are already split by version, although it doesn't help that it isn't linked in an obvious manner. http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/ Updating the documentation takes a lot of effort by all in

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart
l you need people to moderate them, and then that takes time away from people that could either be developing the software or updating documentation. On Thu, Feb 25, 2016 at 11:24 PM, Nigel Williams wrote: > On Fri, Feb 26, 2016 at 4:09 PM, Adam Tygart wrote: >> The docs are already

Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread Adam Tygart
AFAIR, there is a feature request in the works to allow rebuild with K chunks, but not allow normal read/write until min_size is met. Not that I think running with m=1 is a good idea. I'm not seeing the tracker issue for it at the moment, though. -- Adam On Tue, Dec 11, 2018 at 9:50 PM Ashley Merr

[ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
Hello all, I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6. We're using cephfs and rbd. Last night, one of our two active/active mds servers went laggy and upon restart once it goes active it immediately goes laggy again. I've got a log available here (debug_mds 20, debug

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
all running CentOS 7.6 and using the kernel cephfs mount). I hope there is enough logging from before to try to track this issue down. We are back up and running for the moment. -- Adam On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart wrote: > > Hello all, > > I've got a 31 mac

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
t the directory before I re-enable their jobs? -- Adam On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart wrote: > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10 > minutes after that restarted the mds daemon. It replayed the journal, > evicted the dead compute nodes and

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
It worked for about a week, and then seems to have locked up again. Here is the back trace from the threads on the mds: http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt -- Adam On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng wrote: > > On Sun, Jan 13, 2019 at 1:43 PM Adam

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
ith your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io<http://www.croit.io> Tel: +49 89 1896585 90 On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart mailto:mo...@ksu.edu>> wrote: > > It worked for about a week, and then see

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the 27th of December. -- Adam On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart wrote: > > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't > happen before then. > > -- > Adam > >

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
at 8:42 PM Adam Tygart wrote: > > Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the > 27th of December. > > -- > Adam > > On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart wrote: > > > > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of

Re: [ceph-users] Limiting osd process memory use in nautilus.

2019-04-16 Thread Adam Tygart
As of 13.2.3, you should use 'osd_memory_target' instead of 'bluestore_cache_size' -- Adam On Tue, Apr 16, 2019 at 10:28 AM Jonathan Proulx wrote: > > Hi All, > > I have a a few servers that are a bit undersized on RAM for number of > osds they run. > > When we swithced to bluestore about 1yr ag

[ceph-users] MDS Crashing 14.2.1

2019-05-15 Thread Adam Tygart
Hello all, I've got a 30 node cluster serving up lots of CephFS data. We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier this week. We've been running 2 MDS daemons in an active-active setup. Tonight one of the metadata daemons crashed with the following several times: -

Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart
Hello all, The rank 0 mds is still asserting. Is this duplicate inode situation one that I should be considering using the cephfs-journal-tool to export, recover dentries and reset? Thanks, Adam On Thu, May 16, 2019 at 12:51 AM Adam Tygart wrote: > > Hello all, > > I've got

Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart
hu, May 16, 2019, 13:52 Adam Tygart mailto:mo...@ksu.edu>> wrote: Hello all, The rank 0 mds is still asserting. Is this duplicate inode situation one that I should be considering using the cephfs-journal-tool to export, recover dentries and reset? Thanks, Adam On Thu, May 16, 2019 at 12:5

Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

2019-05-17 Thread Adam Tygart
, May 17, 2019 at 2:30 AM wrote: > > Hi >Can you tell me the detail recovery cmd ? > > I just started learning cephfs ,I would be grateful. > > > > 发件人: Adam Tygart > 收件人: Ceph Users > 日期: 2019/05/17 09:04 > 主题:[lists.ceph

Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

2019-05-17 Thread Adam Tygart
y created fs entries would not pass MDS scrub due to linkage errors. May 17, 2019 3:40 PM, "Adam Tygart" mailto:mo...@ksu.edu>> wrote: > I followed the docs from here: > http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts >

[ceph-users] Ceph Balancer Limitations

2019-09-11 Thread Adam Tygart
Hello all, We're using Nautilus 14.2.2 (upgrading soon to 14.2.3) on 29 CentOS osd servers. We've got a large variation of disk sizes and host densities. Such that the default crush mappings lead to an unbalanced data and pg distribution. We enabled the balancer manager module in pg upmap mode.

Re: [ceph-users] Ceph Balancer Limitations

2019-09-13 Thread Adam Tygart
Thanks, I moved back to crush-compat mapping, the pool that was at "90% full" is now under 76% full. Before doing that, I had the automatic balancer off, and ran 'ceph balancer optimize test'. It ran for 12 hours before I killed it. In upmap mode, it was "balanced" or at least as balanced as it c