Speaking of SSD IOPs. Running the same tests on my SSDs (LiteOn
ECT-480N9S 480GB SSDs):
The lines at the bottom are a single 6TB spinning disk for comparison's sake.
http://imgur.com/a/fD0Mh
Based on these numbers, there is a minimum latency per operation, but
multiple operations can be performed
We are using 3.18.6-gentoo. Based on that, I was hoping that the
kernel bug referred to in the bug report would have been fixed.
--
Adam
On Wed, Apr 15, 2015 at 8:02 PM, Yan, Zheng wrote:
> On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson wrote:
>> Thank you, John!
>>
>> That was exactly the bug we
What is significantly smaller? We have 67 requests in the 16,400,000
range and 250 in the 18,900,000 range.
Thanks,
Adam
On Wed, Apr 15, 2015 at 8:38 PM, Yan, Zheng wrote:
> On Thu, Apr 16, 2015 at 9:07 AM, Adam Tygart wrote:
>> We are using 3.18.6-gentoo. Based on that, I was hoping
Adam
On Thu, Apr 16, 2015 at 1:35 AM, Yan, Zheng wrote:
> On Thu, Apr 16, 2015 at 10:44 AM, Adam Tygart wrote:
>> We did that just after Kyle responded to John Spray above. I am
>> rebuilding the kernel now to include dynamic printk support.
>>
>
> Maybe the first cra
We're currently putting data into our cephfs pool (cachepool in front
of it as a caching tier), but the metadata pool contains ~50MB of data
for 36 million files. If that were an accurate estimation, we'd have a
metadata pool closer to ~140GB. Here is a ceph df detail:
http://people.beocat.cis.ksu
s each. How
> do you have things configured, exactly?
>
> On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote:
>>
>> We're currently putting data into our cephfs pool (cachepool in front
>> of it as a caching tier), but the metadata pool contains ~50MB of data
>> f
nt to think the pg statistics reporting is going
> wrong somehow.
> ...I bet the leveldb/omap stuff isn't being included in the of statistics.
> That could be why and would make sense with what you've got here. :)
> -Greg
> On Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote:
>&g
Hello all,
The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.
I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].
Kernel debug from cephfs client during reconnection attempts:
[732586.352173] ceph: mdsc delayed_work
then you can see where it's at when it crashes.
>
> --Lincoln
>
> On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
>
>> Hello all,
>>
>> The ceph-mds servers in our cluster are performing a constant
>> boot->replay->crash in our systems.
>>
>
r multiple active MDS?
>
> --Lincoln
>
> On May 22, 2015, at 10:10 AM, Adam Tygart wrote:
>
>> Thanks for the quick response.
>>
>> I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
>> https://drive.google.com/f
On Fri, May 22, 2015 at 11:47 AM, John Spray wrote:
>
>
> On 22/05/2015 15:33, Adam Tygart wrote:
>>
>> Hello all,
>>
>> The ceph-mds servers in our cluster are performing a constant
>> boot->replay->crash in our systems.
>>
>> I have enabl
sed them and flushed them out of
the caches. I also would have thought that a close file/flush in rsync
(which I am sure it does, after finishing writing a file) would have
let them close in the cephfs session.
--
Adam
On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum wrote:
> On Fri, May 22, 2
.
--
Adam
On Fri, May 22, 2015 at 2:39 PM, Gregory Farnum wrote:
> On Fri, May 22, 2015 at 12:34 PM, Adam Tygart wrote:
>> I believe I grabbed all of theses files:
>>
>> for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
>> metadat
Alright, bumping that up 10 worked. the MDS server came up and
"recovered". Took about 1 minute.
Thanks again, guys.
--
Adam
On Fri, May 22, 2015 at 2:50 PM, Gregory Farnum wrote:
> On Fri, May 22, 2015 at 12:45 PM, Adam Tygart wrote:
>> Fair enough. Anyway, is it safe
Hello all,
I've got a coworker who put "filestore_xattr_use_omap = true" in the
ceph.conf when we first started building the cluster. Now he can't
remember why. He thinks it may be a holdover from our first Ceph
cluster (running dumpling on ext4, iirc).
In the newly built cluster, we are using XF
CephFS doesn't use RBD, it uses the Rados protocol (that RBD uses
behind the scenes). You can set striping parameters for files in
CephFS, though, just as you can for RBD.
The real problem here, as I understand it, is simultaneous access to
the same file. Write-locks happen at the file level, so m
Hello all,
I'm trying to add a new data pool to CephFS, as we need some longer
term archival storage.
ceph mds add_data_pool archive
Error EINVAL: can't use pool 'archive' as it's an erasure-code pool
Here are the steps taken to create the pools for this new datapool:
ceph osd pool create arccac
Hello all,
I've run into some sort of bug with CephFS. Client reads of a
particular file return nothing but 40KB of Null bytes. Doing a rados
level get of the inode returns the whole file, correctly.
Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client.
Attached is a dynamic prin
part of infernalis. The
> original reproducer involved truncating/overwriting files. In your example,
> do you know if 'kstat' has been truncated/overwritten prior to generating
> the md5sums?
>
> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart wrote:
>>
>> Hello
I've done some digging into cp and mv's semantics (from coreutils). If
the inode is existing, the file will get truncated, then data will get
copied in. This is definitely within the scope of the bug above.
--
Adam
On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart wrote:
> It may have b
y way to figure out what happened?
--
Adam
On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart wrote:
> I've done some digging into cp and mv's semantics (from coreutils). If
> the inode is existing, the file will get truncated, then data will get
> copied in. This is definitely wit
ts 'parent', and doing the same on the
parent directory's inode simply lists 'parent'.
Thanks for your time.
--
Adam
On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil wrote:
> On Mon, 5 Oct 2015, Adam Tygart wrote:
>> Okay, this has happened several more times. Alway
Oct 8, 2015 at 11:11 AM, Lincoln Bryant wrote:
> Hi Sage,
>
> Will this patch be in 0.94.4? We've got the same problem here.
>
> -Lincoln
>
>> On Oct 8, 2015, at 12:11 AM, Sage Weil wrote:
>>
>> On Wed, 7 Oct 2015, Adam Tygart wrote:
>>> Does this
The problem is that "hammer" tunables (i.e. "optimal" in v0.94.x) are
incompatible with the kernel interfaces before Linux 4.1 (namely due
to straw2 buckets). To make use of the kernel interfaces in 3.13, I
believe you'll need "firefly" tunables.
--
Adam
On Sun, Nov 8, 2015 at 11:48 PM, Bogdan SO
Supposedly cephfs-hadoop worked and/or works on hadoop 2. I am in the
process of getting it working with cdh5.7.0 (based on hadoop 2.6.0).
I'm under the impression that it is/was working with 2.4.0 at some
point in time.
At this very moment, I can use all of the DFS tools built into hadoop
to crea
A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
At the moment, we're dead in the water. I would appreciate
> 6: (()+0x7dc5) [0x7f35146ecdc5]
> 7: (clone()+0x6d) [0x7f3512d77ced]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(const ceph::hear
ndon referred to. Which in CephFS can be fixed by either
> having smaller folders or (if you're very nervy, and ready to turn on
> something we think works but don't test enough) enabling directory
> fragmentation.
> -Greg
>
> On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart
kR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
> Hopefully it works in your case too and you can the cluster back to a state
> that you can make the CephFS directories smaller.
>
> - Brandon
>
> On Wed, Jun 1, 2016 at 4:22 PM, Gregory Farnum wrote:
>>
>> O
I'm still exporting pgs out of some of the downed osds, but things are
definitely looking promising.
Marginally related to this thread, as these seem to be most of the
hanging objects when exporting pgs, what are inodes in the 600 range
used for within the metadata pool? I know the 200 range is us
m). Could that cause this fragmentation that we think is
the issue?
--
Adam
On Thu, Jun 2, 2016 at 10:32 PM, Adam Tygart wrote:
> I'm still exporting pgs out of some of the downed osds, but things are
> definitely looking promising.
>
> Marginally related to this thread, as thes
With regards to this export/import process, I've been exporting a pg
from an osd for more than 24 hours now. The entire OSD only has 8.6GB
of data. 3GB of that is in omap. The export for this particular PG is
only 108MB in size right now, after more than 24 hours. How is it
possible that a fragment
If your monitor nodes are separate from the osd nodes, I'd get ceph
upgraded to the latest point release of your current line (0.94.7).
Upgrade monitors, then osds, then other dependent services (mds, rgw,
qemu).
Once everything is happy again, I'd run OS and ceph upgrades together,
starting with m
Would it be beneficial for anyone to have an archive copy of an osd
that took more than 4 days to export. All but an hour of that time was
spent exporting 1 pg (that ended up being 197MB). I can even send
along the extracted pg for analysis...
--
Adam
On Fri, Jun 3, 2016 at 2:39 PM, Adam Tygart
IPoIB is done with broadcast packets on the Infiniband fabric. Most
switches and opensm (by default) setup a broadcast group at the lowest
IB speed (SDR), to support all possible IB connections. If you're
using pure DDR, you may need to tune the broadcast group in your
subnet manager to set the spe
I believe this is what you want:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html
--
Adam
On Thu, Jun 9, 2016 at 10:01 AM, Gandalf Corvotempesta
wrote:
> Il 09 giu 2016 15:41, "Adam Tygart"
This sounds an awful lot like a a bug I've run into a few times (not
often enough to get a good backtrace out of the kernel or mds)
involving vim on a symlink to a file in another directory. It will
occasionally corrupt the symlink in such a way that the symlink is
unreadable. Filling dmesg with:
According to Sage[1] Bluestore makes use of the pagecache. I don't
believe read-ahead is a filesystem tunable in Linux, it is set on the
block device itself, therefore read-ahead shouldn't be an issue.
I'm not familiar enough with Bluestore to comment on the rest.
[1] http://www.spinics.net/lists
Responses inline.
On Sat, Jun 18, 2016 at 4:53 PM, ServerPoint wrote:
> Hi,
>
> I am trying to setup a Ceph cluster and mount it as CephFS
>
> These are the steps that I followed :
> -
> ceph-deploy new mon
> ceph-deploy install admin mon node2 nod
Some people are doing hyperconverged ceph, colocating qemu
virtualization with ceph-osds. It is relevant for a decent subset of
people here. Therefore knowledge of the degree of performance
degradation is useful.
--
Adam
On Thu, Jan 11, 2018 at 11:38 AM, wrote:
> I don't understand how all of t
Hello all,
I've gotten myself into a bit of a bind. I was prepping to add a new
mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
Unfortunately, it started the mds server before I was ready. My
cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
This caused 3 of my 5 m
I could, I suppose, update the monmaps in the working monitors to
remove the broken ones and then re-deploy the broken ones. The main
concern I have is that if the mdsmap update isn't pending on the
working ones, what else isn't in sync.
Thoughts?
--
Adam
On Fri, Sep 30, 2016 at 11:0
Hello all,
Not sure if this went through before or not, as I can't check the
mailing list archives.
I've gotten myself into a bit of a bind. I was prepping to add a new
mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
Unfortunately, it started the mds server before I was ready. My
e and standby servers while initializing the mons?
I would hope that, now that all the versions are in sync, a bad
standby_for_fscid would not be possible with new mds servers starting.
--
Adam
On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum wrote:
> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygar
Sent before I was ready, oops.
How might I get the osdmap from a down cluster?
--
Adam
On Mon, Oct 3, 2016 at 12:29 AM, Adam Tygart wrote:
> I put this in the #ceph-dev on Friday,
>
> (gdb) print info
> $7 = (const MDSMap::mds_info_t &) @0x5fb1da68: {
>
I've hit this today with an upgrade to 12.2.6 on my backup cluster.
Unfortunately there were issues with the logs (in that the files
weren't writable) until after the issue struck.
2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
[ERR] : 5.255 full-object read crc 0x4e97b4e != e
Bluestore.
On Fri, Jul 13, 2018, 05:56 Dan van der Ster wrote:
> Hi Adam,
>
> Are your osds bluestore or filestore?
>
> -- dan
>
>
> On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart wrote:
> >
> > I've hit this today with an upgrade to 12.2.6 on my bac
Check out the message titled "IMPORTANT: broken luminous 12.2.6
release in repo, do not upgrade"
It sounds like 12.2.7 should come *soon* to fix this transparently.
--
Adam
On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard
wrote:
> Hi all,
>
> I have the same problem here:
> * during the upgra
I don't care what happens to most of these rados commands, and I've
never used the auid "functionality", but I have found the rados purge
command quite useful when testing different rados level applications.
Run a rados-level application test. Whoops it didn't do what you
wanted, purge and start o
This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.
Check /etc/sysconfig/ceph under RHEL based distros.
--
Adam
On Mon, Aug 27, 2018 at 9:51 PM Tyler Bisho
Hello all,
I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
cephfs mounts (kernel client). We're currently only using 1 active
mds.
Performance is great about 80% of the time. MDS responses (per ceph
daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
with a la
DS needing to catch up on log trimming (though I’m unclear why changing the
> cache size would impact this).
>
> On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart wrote:
>>
>> Hello all,
>>
>> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
g with
ops, ops_in_flight, perf dump and objecter requests. Thanks for your
time.
--
Adam
On Mon, Oct 1, 2018 at 10:36 PM Adam Tygart wrote:
>
> Okay, here's what I've got: https://www.paste.ie/view/abe8c712
>
> Of note, I've changed things up a little bit for the moment. I've
Hello all,
I'm having some stability issues with my ceph cluster at the moment.
Using CentOS 7, and Ceph 12.2.4.
I have osds that are segfaulting regularly. roughly every minute or
so, and it seems to be getting worse, now with cascading failures.
Backtraces look like this:
ceph version 12.2.4
Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.
I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than
:11 PM, Josh Durgin wrote:
>>
>> On 04/05/2018 06:15 PM, Adam Tygart wrote:
>>>
>>> Well, the cascading crashes are getting worse. I'm routinely seeing
>>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering
>>> 14 or so of them
ecially the
> complete_read_op one from http://tracker.ceph.com/issues/21931
>
> Josh
>
>
> On 04/05/2018 08:25 PM, Adam Tygart wrote:
>>
>> Thank you! Setting norecover has seemed to work in terms of keeping
>> the osds up. I am glad my logs were of use to trackin
Ceph has the ability to us a script to figure out where in the
crushmap this disk should go (on osd start):
http://docs.ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook
--
Adam
On Tue, Apr 18, 2017 at 7:53 AM, Matthew Vernon wrote:
> On 17/04/17 21:16, Richard Hesse wrot
I'm using CephFS, on CentOS 7. We're currently migrating away from
using a catch-all cephx key to mount the filesystem (with the kernel
module), to a much more restricted key.
In my tests, I've come across an issue, extracting a tar archive with
a mount using the restricted key routinely cannot cr
As I understand it:
4.2G is used by ceph (all replication, metadata, et al) it is a sum of
all the space "used" on the osds.
958M is the actual space the data in cephfs is using (without replication).
3.8G means you have some sparse files in cephfs.
'ceph df detail' should return something close
It appears that with --apparent-size, du adds the "size" of the
directories to the total as well. On most filesystems this is the
block size, or the amount of metadata space the directory is using. On
CephFS, this size is fabricated to be the size sum of all sub-files.
i.e. a cheap/free 'du -sh $fo
The docs are already split by version, although it doesn't help that
it isn't linked in an obvious manner.
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/
Updating the documentation takes a lot of effort by all in
l you need people to moderate them, and
then that takes time away from people that could either be developing
the software or updating documentation.
On Thu, Feb 25, 2016 at 11:24 PM, Nigel Williams
wrote:
> On Fri, Feb 26, 2016 at 4:09 PM, Adam Tygart wrote:
>> The docs are already
AFAIR, there is a feature request in the works to allow rebuild with K
chunks, but not allow normal read/write until min_size is met. Not
that I think running with m=1 is a good idea. I'm not seeing the
tracker issue for it at the moment, though.
--
Adam
On Tue, Dec 11, 2018 at 9:50 PM Ashley Merr
Hello all,
I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
We're using cephfs and rbd.
Last night, one of our two active/active mds servers went laggy and
upon restart once it goes active it immediately goes laggy again.
I've got a log available here (debug_mds 20, debug
all running CentOS 7.6 and using
the kernel cephfs mount). I hope there is enough logging from before
to try to track this issue down.
We are back up and running for the moment.
--
Adam
On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart wrote:
>
> Hello all,
>
> I've got a 31 mac
t the directory before I re-enable their jobs?
--
Adam
On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart wrote:
>
> On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> minutes after that restarted the mds daemon. It replayed the journal,
> evicted the dead compute nodes and
It worked for about a week, and then seems to have locked up again.
Here is the back trace from the threads on the mds:
http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
--
Adam
On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng wrote:
>
> On Sun, Jan 13, 2019 at 1:43 PM Adam
ith your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90
On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart
mailto:mo...@ksu.edu>> wrote:
>
> It worked for about a week, and then see
Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the
27th of December.
--
Adam
On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart wrote:
>
> Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't
> happen before then.
>
> --
> Adam
>
>
at 8:42 PM Adam Tygart wrote:
>
> Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the
> 27th of December.
>
> --
> Adam
>
> On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart wrote:
> >
> > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of
As of 13.2.3, you should use 'osd_memory_target' instead of
'bluestore_cache_size'
--
Adam
On Tue, Apr 16, 2019 at 10:28 AM Jonathan Proulx wrote:
>
> Hi All,
>
> I have a a few servers that are a bit undersized on RAM for number of
> osds they run.
>
> When we swithced to bluestore about 1yr ag
Hello all,
I've got a 30 node cluster serving up lots of CephFS data.
We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
this week.
We've been running 2 MDS daemons in an active-active setup. Tonight
one of the metadata daemons crashed with the following several times:
-
Hello all,
The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?
Thanks,
Adam
On Thu, May 16, 2019 at 12:51 AM Adam Tygart wrote:
>
> Hello all,
>
> I've got
hu, May 16, 2019, 13:52 Adam Tygart mailto:mo...@ksu.edu>>
wrote:
Hello all,
The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?
Thanks,
Adam
On Thu, May 16, 2019 at 12:5
, May 17, 2019 at 2:30 AM wrote:
>
> Hi
>Can you tell me the detail recovery cmd ?
>
> I just started learning cephfs ,I would be grateful.
>
>
>
> 发件人: Adam Tygart
> 收件人: Ceph Users
> 日期: 2019/05/17 09:04
> 主题:[lists.ceph
y created
fs entries would not pass MDS scrub due to linkage errors.
May 17, 2019 3:40 PM, "Adam Tygart" mailto:mo...@ksu.edu>> wrote:
> I followed the docs from here:
> http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
Hello all,
We're using Nautilus 14.2.2 (upgrading soon to 14.2.3) on 29 CentOS osd servers.
We've got a large variation of disk sizes and host densities. Such
that the default crush mappings lead to an unbalanced data and pg
distribution.
We enabled the balancer manager module in pg upmap mode.
Thanks,
I moved back to crush-compat mapping, the pool that was at "90% full"
is now under 76% full.
Before doing that, I had the automatic balancer off, and ran 'ceph
balancer optimize test'. It ran for 12 hours before I killed it. In
upmap mode, it was "balanced" or at least as balanced as it c
79 matches
Mail list logo