Here are the ceph log messages (including the libceph kernel debug
stuff you asked for) from a node boot with the rbd command hung for a
couple of minutes:
https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos
Hum sorry, you're right. Forget about what I said :)
On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
I thought the Client would then write to the 2nd is this wrong?
Stefan
Am 22.11.2012 um 16:49 schrieb Sébastien Han han.sebast...@gmail.com:
But
We need something like tmpfs - running in local memory but support dio.
Maybe with ramdisk, /dev/ram0 ?
we can format it with standard filesystem (ext3,ext4,...) so maybe dio works
with it ?
- Mail original -
De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Sébastien
On Thu, 22 Nov 2012, hemant surale wrote:
Sir,
Thanks for the direction . Here I was using mount.ceph monaddr:ip:/
/home/hemant/mntpoint cmd . Is it possible to do achieve same effect
with mount.ceph of what you suggested with cephfs. (cephfs
/mnt/ceph/foo --pool poolid)
But I see that
2012/11/22 Gregory Farnum g...@inktank.com:
On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang gongfan...@gmail.com wrote:
2012/11/21 Gregory Farnum g...@inktank.com:
No, absolutely not. There is no relationship between different RADOS
pools. If you've been using the cephfs tool to place some
On 11/22/2012 06:57 PM, Stefan Priebe - Profihost AG wrote:
Hi,
Am 21.11.2012 14:47, schrieb Wido den Hollander:
The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.
To make it consistent you have to run sync (In the VM) just prior to
creating the
I thought the Client would then write to the 2nd is this wrong?
Stefan
Am 22.11.2012 um 16:49 schrieb Sébastien Han han.sebast...@gmail.com:
But who cares? it's also on the 2nd node. or even on the 3rd if you have
replicas 3.
Yes but you could also suffer a crash while writing the first
Am 22.11.2012 13:50, schrieb Sébastien Han:
journal is running on tmpfs to me but that changes nothing.
I don't think it works then. According to the doc: Enables using
libaio for asynchronous writes to the journal. Requires journal dio
set to true.
Ah might be but as the SSDs are pretty
Am 22.11.2012 15:37, schrieb Mark Nelson:
I don't think we recommend tmpfs at all for anything other than playing
around. :)
I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.
I see no other option while
On 11/22/2012 04:49 AM, Sébastien Han wrote:
@Alexandre: cool!
@ Stefan: Full SSD cluster and 10G switches? Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?
@Mark Kampe
If I read the above correctly, your random operations are 4K and your
On Thu, Nov 22, 2012 at 2:05 AM, Josh Durgin josh.dur...@inktank.com wrote:
On 11/21/2012 04:50 AM, Andrey Korolyov wrote:
Hi,
Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:
$ rbd snap purge dev-rack0/vm2
Removing all
Otherwise you would have the same problem with the disk crashes
Am 22.11.2012 um 16:55 schrieb Sébastien Han han.sebast...@gmail.com:
Hum sorry, you're right. Forget about what I said :)
On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
I thought
Dear all,
I am trying to do some small experiment with crushtool by simulating
different variants of CRUSHs.
However, I encounter some problem with crushtool due to its lack of
documentation.
I want to ask the command to simulate the placement in a 32-device
bucket system (only 1 bucket)? And
I don't think we recommend tmpfs at all for anything other than playing
around. :)
On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
Hi,
can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?
Greets,
Stefan
-
It's very easy to reproduce now with my automated install script, the
most I've seen it succeed with that patch is 2 in a row, and hanging
on the 3rd, although it hangs on most builds. So it shouldn't take
much to get it to do it again. I'll try and get to that tomorrow,
when I'm a bit more
Am 22.11.2012 14:22, schrieb Sébastien Han:
And RAMDISK devices are too expensive.
It would make sense in your infra, but yes they are really expensive.
We need something like tmpfs - running in local memory but support dio.
Stefan
--
To unsubscribe from this list: send the line
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.
qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.
To archieve this it introduces a new status flag which uses
-EINPROGRESS.
Signed-off-by: Stefan Priebe
Am 22.11.2012 15:52, schrieb Alexandre DERUMIER:
I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.
I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the
Am 22.11.2012 15:46, schrieb Mark Nelson:
I haven't played a whole lot with SSD only OSDs yet (other than noting
last summer that iop performance wasn't as high as I wanted it). Is a
second partition on the SSD for the journal not an option for you?
Haven't tested that. But does this makes
Am 22.11.2012 11:49, schrieb Sébastien Han:
@Alexandre: cool!
@ Stefan: Full SSD cluster and 10G switches?
Yes
Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?
journal is running on tmpfs to me but that changes nothing.
Stefan
--
To
I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.
I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support this.
Do you
Hi Andreas,
thanks for your comment. Do i have to resend this patch?
--
Greets,
Stefan
Am 22.11.2012 17:40, schrieb Andreas Färber:
Am 22.11.2012 10:07, schrieb Stefan Priebe:
When acb-cmd is WRITE or DISCARD block/rbd stores rcb-size into acb-ret
Look here:
if (acb-cmd == RBD_AIO_WRITE
Hi All,
Is it possible at this point in time to setup some form of tiering of storage
pools in ceph by modifying the crush map? For example I want to have my most
recently used data on a small set of nodes that have SSD's and over time
migrate data from the SSD's to some bulk spinning disk
Sir,
Thanks for the direction . Here I was using mount.ceph monaddr:ip:/
/home/hemant/mntpoint cmd . Is it possible to do achieve same effect
with mount.ceph of what you suggested with cephfs. (cephfs
/mnt/ceph/foo --pool poolid)
But I see that cephfs is able to set which osd to use , the
Hi,
In the recent versions Ceph introduces some unexpected behavior for
the permanent connections (VM or kernel clients) - after crash
recovery, I/O will hang on the next planned scrub on the following
scenario:
- launch a bunch of clients doing non-intensive writes,
- lose one or more osd, mark
Same to me:
rand 4k: 23.000 iops
seq 4k: 13.000 iops
Even in writeback mode where normally seq 4k should be merged into
bigger requests.
Stefan
Am 21.11.2012 17:34, schrieb Mark Nelson:
Responding to my own message. :)
Talked to Sage a bit offline about this. I think there are two
From: Yan, Zheng zheng.z@intel.com
When a null dentry is encountered, CDir::_commit_partial() adds
a OSD_TMAP_RM command to delete the dentry. But if the dentry is
new, the osd will not find the dentry when handling the command
and the tmap update operation will fail totally.
This patch also
but it seems that Alexandre and I have the same results (more rand
than seq), he has (at least) one cluster and I have 2. Thus I start to
think that's not an isolated issue.
Hi, I have bought new servers with more powerfull cpus to made a new 3 nodes
cluster to compare.
I'll redo tests in 1
Hello
Thank for your attention, and sorry for my bad english!
In my draft architecture, i want use same hardware for osd and rbd
devices. In other words, i have 5 nodes this 5TB software raid on each
Disk space. I want build on this nodes, ceph cluster. All 5 nodes will
be run OSD and, on the
Am 22.11.2012 10:07, schrieb Stefan Priebe:
When acb-cmd is WRITE or DISCARD block/rbd stores rcb-size into acb-ret
Look here:
if (acb-cmd == RBD_AIO_WRITE ||
acb-cmd == RBD_AIO_DISCARD) {
if (r 0) {
acb-ret = r;
acb-error = 1;
} else
Am 22.11.2012 16:26, schrieb Alexandre DERUMIER:
Haven't tested that. But does this makes sense? I mean data goes to Disk
journal - same disk then has to copy the Data from part A to part B.
Why is this an advantage?
Well, if you are cpu limited, I don't think you can use all 8*35000iops by
Am 21.11.2012 23:32, schrieb Peter Maydell:
On 21 November 2012 17:03, Stefan Weil s...@weilnetz.de wrote:
Why do you use int64_t instead of off_t?
If the value is related to file sizes, off_t would be a good choice.
Looking at the librbd API (which is what the size and ret
values come from),
On Thu, 22 Nov 2012, Yan, Zheng wrote:
From: Yan, Zheng zheng.z@intel.com
When a null dentry is encountered, CDir::_commit_partial() adds
a OSD_TMAP_RM command to delete the dentry. But if the dentry is
new, the osd will not find the dentry when handling the command
and the tmap update
On Wed, 21 Nov 2012, Nick Bartos wrote:
FYI the build which included all 3.5 backports except patch #50 is
still going strong after 21 builds.
Okay, that one at least makes some sense. I've opened
http://tracker.newdream.net/issues/3519
How easy is this to reproduce? If it is
Am 22.11.2012 20:09, schrieb Stefan Priebe - Profihost AG:
Hi Andreas,
thanks for your comment. Do i have to resend this patch?
--
Greets,
Stefan
Hi Stefan,
I'm afraid yes, you'll have to resend the patch.
Signed-off-by is a must, see http://wiki.qemu.org/Contribute/SubmitAPatch
When
On 22 November 2012 08:23, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
Am 21.11.2012 23:32, schrieb Peter Maydell:
Looking at the librbd API (which is what the size and ret
values come from), it uses size_t and ssize_t for these.
So I think probably ssize_t is the right type for
On 21 November 2012 17:03, Stefan Weil s...@weilnetz.de wrote:
Why do you use int64_t instead of off_t?
If the value is related to file sizes, off_t would be a good choice.
Looking at the librbd API (which is what the size and ret
values come from), it uses size_t and ssize_t for these.
So I
Hi,
Am 21.11.2012 15:29, schrieb Wido den Hollander:
Use:
$ rbd -p kvmpool1 snap create --image vm-113-disk-1 BACKUP
rbd -h also tells:
image-name, snap-name are [pool/]name[@snap], or you may specify
individual pieces of names with -p/--pool, --image, and/or --snap.
Never tried it, but you
Hi,
can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?
Greets,
Stefan
- Mail original -
De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Sébastien Han han.sebast...@gmail.com
Cc: Mark Nelson
Hi,
Am 21.11.2012 14:47, schrieb Wido den Hollander:
The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.
To make it consistent you have to run sync (In the VM) just prior to
creating the snapshot.
Mhm but between executing sync and executing snap is
Sequential is faster than random on a disk, but we are not
doing I/O to a disk, but a distributed storage cluster:
small random operations are striped over multiple objects and
servers, and so can proceed in parallel and take advantage of
more nodes and disks. This parallelism can
Hello list,
right now a rbd export exports exactly the size of the disk even if
there is KNOWN free space. Is this inteded to change?
Might it be possible to export just differences between snapshots and
merge them later?
Greets,
Stefan
--
To unsubscribe from this list: send the line
FYI the build which included all 3.5 backports except patch #50 is
still going strong after 21 builds.
On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote:
With 8 successful installs already done, I'm reasonably confident that
it's patch #50. I'm making another build which
When acb-cmd is WRITE or DISCARD block/rbd stores rcb-size into acb-ret
Look here:
if (acb-cmd == RBD_AIO_WRITE ||
acb-cmd == RBD_AIO_DISCARD) {
if (r 0) {
acb-ret = r;
acb-error = 1;
} else if (!acb-error) {
acb-ret = rcb-size;
Hello,
i send a new patch using ssize_t. (Subject [PATCH] overflow of int ret:
use ssize_t for ret)
Stefan
Am 22.11.2012 09:40, schrieb Peter Maydell:
On 22 November 2012 08:23, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
Am 21.11.2012 23:32, schrieb Peter Maydell:
Looking
Am 21.11.2012 21:53, schrieb Stefan Priebe -
Profihost AG:
Not sure about off_t. What is min and max size? Stefan
off_t is a signed value which is used in function lseek to
address any byte of a seekable file.
The range is typically 64 bit
On 11/21/2012 04:50 AM, Andrey Korolyov wrote:
Hi,
Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:
$ rbd snap purge dev-rack0/vm2
Removing all snapshots: 100% complete...done.
I see one bug with 'snap purge' ignoring the return
__printf is useful to verify format and arguments.
Signed-off-by: Joe Perches j...@perches.com
---
fs/ceph/super.c |2 +-
include/linux/backing-dev.h |1 +
2 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2eb43f2..e7dbb5c
On Thu, 22 Nov 2012, Andrey Korolyov wrote:
Hi,
In the recent versions Ceph introduces some unexpected behavior for
the permanent connections (VM or kernel clients) - after crash
recovery, I/O will hang on the next planned scrub on the following
scenario:
- launch a bunch of clients
From: Yan, Zheng zheng.z@intel.com
When a null dentry is encountered, CDir::_commit_partial() adds
a OSD_TMAP_RM command to delete the dentry. But if the dentry is
new, the osd will not find the dentry when handling the command
and the tmap update operation will fail totally.
Signed-off-by:
Signed-off-by: Stefan Priebe s.pri...@profihost.ag
Am 22.11.2012 10:07, schrieb Stefan Priebe:
When acb-cmd is WRITE or DISCARD block/rbd stores rcb-size into acb-ret
Look here:
if (acb-cmd == RBD_AIO_WRITE ||
acb-cmd == RBD_AIO_DISCARD) {
if (r 0) {
In my test it was just recovering some replicas not the whole osd.
Am 22.11.2012 um 16:35 schrieb Alexandre DERUMIER aderum...@odiso.com:
But who cares? it's also on the 2nd node. or even on the 3rd if you have
replicas 3.
Yes, but rebuilding a dead node use cpu and ios. (but it should be
Hi Mark,
Well the most concerning thing is that I have 2 Ceph clusters and both
of them show better rand than seq...
I don't have enough background to argue on your assomptions but I
could try to skrink my test platform to a single OSD and how it
performs. We keep in touch on that one.
But it
Ηι,
I was looking at the source code of the ceph MDS and in particular at
the function
CInode* Server::prepare_new_inode(...) in the mds/Server.cc file which
creates a new inode.
At lines 1739-1747 the code checks if the parent directory has the
set-group-ID bit set. If
this bit is set and the
Hi,
I know that ceph has time synced servers has a requirements, but I
think a sane failure mode like a message in the logs instead of
incontrollably growing memory usage would be a good idea.
I had the NTP process die on me tonight on an OSD (for unknown reason
so far ...) and the clock went
I haven't played a whole lot with SSD only OSDs yet (other than noting
last summer that iop performance wasn't as high as I wanted it). Is a
second partition on the SSD for the journal not an option for you?
Mark
On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:
Am 22.11.2012
On 11/22/2012 05:13 AM, Wido den Hollander wrote:
On 11/22/2012 06:57 PM, Stefan Priebe - Profihost AG wrote:
Hi,
Am 21.11.2012 14:47, schrieb Wido den Hollander:
The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.
To make it consistent you have to
Hi Josh,
Am 22.11.2012 22:08, schrieb Josh Durgin:
This way you have a pretty consistent snapshot.
You can get an entirely consistent snapshot using xfs_freeze to
stop I/O to the fs until you thaw it. It's done at the vfs level
these days, so it works on all filesystems.
Great thing we
But who cares? it's also on the 2nd node. or even on the 3rd if you have
replicas 3.
Yes, but rebuilding a dead node use cpu and ios. (but it should be benched too,
to see the impact on the production)
- Mail original -
De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À:
On Thu, 22 Nov 2012, Giorgos Kappes wrote:
??,
I was looking at the source code of the ceph MDS and in particular at
the function
CInode* Server::prepare_new_inode(...) in the mds/Server.cc file which
creates a new inode.
At lines 1739-1747 the code checks if the parent directory has the
On Thu, 22 Nov 2012, Stefan Priebe - Profihost AG wrote:
Hello list,
right now a rbd export exports exactly the size of the disk even if there is
KNOWN free space. Is this inteded to change?
Might it be possible to export just differences between snapshots and merge
them later?
We were
Haven't tested that. But does this makes sense? I mean data goes to Disk
journal - same disk then has to copy the Data from part A to part B.
Why is this an advantage?
Well, if you are cpu limited, I don't think you can use all 8*35000iops by node.
So, maybe a benchmark can tell us if the
But who cares? it's also on the 2nd node. or even on the 3rd if you have
replicas 3.
Yes but you could also suffer a crash while writing the first replica.
If the journal is in tmpfs, there is nothing to replay.
On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
journal is running on tmpfs to me but that changes nothing.
I don't think it works then. According to the doc: Enables using
libaio for asynchronous writes to the journal. Requires journal dio
set to true.
On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag
Hi folks,
I figured it might be a cool thing to have packages of ceph-deploy for
Debian and Ubuntu 12.04; I took the time and created them (along with
packages of python-pushy, which ceph-deploy needs but which was not
present in the Debian archive and thus in the Ubuntu archive either).
They
Hi list,
I am thinking about the possibility to add some primitive in CRUSH to meet
the following user stories:
A. Same host, Same rack
To balance between availability and performance ,one may like such a
rule: 3 Replicas, Replica 1 and Replica 2 should in the same rack while
Step 2 is to export the incremental changes. The hangup there is figuring out
a generic and portable file format to represent those incremental changes;
we'd rather not invent something ourselves that is ceph-specific.
Suggestions welcome!
AFAIK, both 'zfs' and 'btrfs' already have such
I upgraded to 0.54 and now there are some hints in the logs. The
directories referenced in the log entries are now missing:
2012-11-23 07:28:04.802864 mds.0 [ERR] loaded dup inode 100662f
[2,head] v3851654 at /xxx/20120203, but inode 100662f.head
v3853093 already exists at
68 matches
Mail list logo