[ceph-users] Crush Map for test lab

2017-10-11 Thread Ashley Merrick
Hello,

Setting up a new test lab, single server 5 disks/OSD.

Want to run an EC Pool that has more shards than avaliable OSD's , is it 
possible to force crush to 're use an OSD for another shard?

I know normally this is bad practice but is for testing only on a single server 
setup.

Thanks,
Ashley

Get Outlook for Android

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

It’s a fair point – in our case we are based on CentOS so self-support only 
anyway (business does not like paying support costs).  At the time we evaluated 
LIO, SCST and STGT, with a  directive to use ALUA support instead of IP 
failover.   In the end we went with SCST as it had more mature ALUA support at 
the time, and was easier to integrate into pacemaker to support the ALUA 
failover, it also seemed to perform fairly well.

However given the road we have gone down and the issues we are facing as we 
scale up and load up the storage, having a vendor support channel would be a 
relief.


From: Samuel Soulard [mailto:samuel.soul...@gmail.com]
Sent: Thursday, 12 October 2017 11:20 AM
To: Adrian Saul 
Cc: Zhu Lingshan ; dilla...@redhat.com; ceph-users 

Subject: RE: [ceph-users] Ceph-ISCSI

Yes I looked at this solution, and it seems interesting.  However, one point 
often stick with business requirements is commercial support.

With Redhat or Suse, you have support provided with the solution.   I'm not 
sure about SCST what support channel they offer.

Sam

On Oct 11, 2017 20:05, "Adrian Saul" 
> wrote:

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users 
> [mailto:ceph-users-boun...@lists.ceph.com]
>  On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard >
> Cc: ceph-users >; Zhu 
> Lingshan >
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
> > wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > >
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >> > wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't
> >> > here right now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Samuel Soulard
Yes I looked at this solution, and it seems interesting.  However, one
point often stick with business requirements is commercial support.

With Redhat or Suse, you have support provided with the solution.   I'm not
sure about SCST what support channel they offer.

Sam

On Oct 11, 2017 20:05, "Adrian Saul"  wrote:

>
> As an aside, SCST  iSCSI will support ALUA and does PGRs through the use
> of DLM.  We have been using that with Solaris and Hyper-V initiators for
> RBD backed storage but still have some ongoing issues with ALUA (probably
> our current config, we need to lab later recommendations).
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Jason Dillaman
> > Sent: Thursday, 12 October 2017 5:04 AM
> > To: Samuel Soulard 
> > Cc: ceph-users ; Zhu Lingshan 
> > Subject: Re: [ceph-users] Ceph-ISCSI
> >
> > On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
> >  wrote:
> > > Hmmm, If you failover the identity of the LIO configuration including
> > > PGRs (I believe they are files on disk), this would work no?  Using an
> > > 2 ISCSI gateways which have shared storage to store the LIO
> > > configuration and PGR data.
> >
> > Are you referring to the Active Persist Through Power Loss (APTPL)
> support
> > in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> > suppose that would work for a Pacemaker failover if you had a shared file
> > system mounted between all your gateways *and* the initiator requests
> > APTPL mode(?).
> >
> > > Also, you said another "fails over to another port", do you mean a
> > > port on another ISCSI gateway?  I believe LIO with multiple target
> > > portal IP on the same node for path redundancy works with PGRs.
> >
> > Yes, I was referring to the case with multiple active iSCSI gateways
> which
> > doesn't currently distribute PGRs to all gateways in the group.
> >
> > > In my scenario, if my assumptions are correct, you would only have 1
> > > ISCSI gateway available through 2 target portal IP (for data path
> > > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > > failover to the standby node with the PGR data that is available on
> share
> > stored.
> > >
> > >
> > > Sam
> > >
> > > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > > wrote:
> > >>
> > >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> > >>  wrote:
> > >> > Hi to all,
> > >> >
> > >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> > >> > is, RBD block device mounted on the ISCSI gateway and published
> > >> > through LIO).
> > >> > The
> > >> > LIO target portal (virtual IP) would failover to another node.
> > >> > This would theoretically provide support for PGRs since LIO does
> > >> > support SPC-3.
> > >> > Granted it is not distributed and limited to 1 single node
> > >> > throughput, but this would achieve high availability required by
> > >> > some environment.
> > >>
> > >> Yes, LIO technically supports PGR but it's not distributed to other
> > >> nodes. If you have a pacemaker-initiated target failover to another
> > >> node, the PGR state would be lost / missing after migration (unless I
> > >> am missing something like a resource agent that attempts to preserve
> > >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> > >> but the initiator cannot reach it), after it fails over to another
> > >> port the PGR data won't be available.
> > >>
> > >> > Of course, multiple target portal would be awesome since available
> > >> > throughput would be able to scale linearly, but since this isn't
> > >> > here right now, this would provide at least an alternative.
> > >>
> > >> It would definitely be great to go active/active but there are
> > >> concerns of data-corrupting edge conditions when using MPIO since it
> > >> relies on client-side failure timers that are not coordinated with
> > >> the target.
> > >>
> > >> For example, if an initiator writes to sector X down path A and there
> > >> is delay to the path A target (i.e. the target and initiator timeout
> > >> timers are not in-sync), and MPIO fails over to path B, quickly
> > >> performs the write to sector X and performs second write to sector X,
> > >> there is a possibility that eventually path A will unblock and
> > >> overwrite the new value in sector 1 with the old value. The safe way
> > >> to handle that would require setting the initiator-side IO timeouts
> > >> to such high values as to cause higher-level subsystems to mark the
> > >> MPIO path as failed should a failure actually occur.
> > >>
> > >> The iSCSI MCS protocol would address these concerns since in theory
> > >> path B could discover that the retried IO was actually a retry, but
> > >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> > >> 

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard 
> Cc: ceph-users ; Zhu Lingshan 
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >>  wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't
> >> > here right now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it
> >> relies on client-side failure timers that are not coordinated with
> >> the target.
> >>
> >> For example, if an initiator writes to sector X down path A and there
> >> is delay to the path A target (i.e. the target and initiator timeout
> >> timers are not in-sync), and MPIO fails over to path B, quickly
> >> performs the write to sector X and performs second write to sector X,
> >> there is a possibility that eventually path A will unblock and
> >> overwrite the new value in sector 1 with the old value. The safe way
> >> to handle that would require setting the initiator-side IO timeouts
> >> to such high values as to cause higher-level subsystems to mark the
> >> MPIO path as failed should a failure actually occur.
> >>
> >> The iSCSI MCS protocol would address these concerns since in theory
> >> path B could discover that the retried IO was actually a retry, but
> >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> >> initiators.
> >>
> >> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
> >> > wrote:
> >> >>
> >> >> Hi Jason,
> >> >>
> >> >> Thanks for the detailed write-up...
> >> >>
> >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >> >>
> >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> > > As far as I am able to understand there are 2 ways of setting
> >> >> > > iscsi for ceph
> >> >> > >
> >> >> > > 1- using kernel (lrbd) only 

Re: [ceph-users] assertion error trying to start mds server

2017-10-11 Thread Bill Sharer
I was wondering if I can't get the second mds back up That offline
backward scrub check sounds like it should be able to also salvage what
it can of the two pools to a normal filesystem.  Is there an option for
that or has someone written some form of salvage tool?

On 10/11/2017 07:07 AM, John Spray wrote:
> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer  wrote:
>> I've been in the process of updating my gentoo based cluster both with
>> new hardware and a somewhat postponed update.  This includes some major
>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new
>> hardware.  The existing cluster was on 10.2.2, but I was going to
>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
>> transitioning to bluestore on the osd's.
>>
>> The Ryzen units are slated to be bluestore based OSD servers if and when
>> I get to that point.  Up until the mds failure, they were simply cephfs
>> clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
>> MON) and had two servers left to update.  Both of these are also MONs
>> and were acting as a pair of dual active MDS servers running 10.2.2.
>> Monday morning I found out the hard way that an UPS one of them was on
>> has a dead battery.  After I fsck'd and came back up, I saw the
>> following assertion error when it was trying to start it's mds.B server:
>>
>>
>>  mdsbeacon(64162/B up:replay seq 3 v4699) v7  126+0+0 (709014160
>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d
>> 8d0
>>  0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
>> function 'virtual void EImportStart::r
>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x82) [0x55f93d64a122]
>>  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
>>  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
>>  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
>>  5: (()+0x74a4) [0x7f6fd009b4a4]
>>  6: (clone()+0x6d) [0x7f6fce5a598d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> --- logging levels ---
>>0/ 5 none
>>0/ 1 lockdep
>>0/ 1 context
>>1/ 1 crush
>>1/ 5 mds
>>1/ 5 mds_balancer
>>1/ 5 mds_locker
>>1/ 5 mds_log
>>1/ 5 mds_log_expire
>>1/ 5 mds_migrator
>>0/ 1 buffer
>>0/ 1 timer
>>0/ 1 filer
>>0/ 1 striper
>>0/ 1 objecter
>>0/ 5 rados
>>0/ 5 rbd
>>0/ 5 rbd_mirror
>>0/ 5 rbd_replay
>>0/ 5 journaler
>>0/ 5 objectcacher
>>0/ 5 client
>>0/ 5 osd
>>0/ 5 optracker
>>0/ 5 objclass
>>1/ 3 filestore
>>1/ 3 journal
>>0/ 5 ms
>>1/ 5 mon
>>0/10 monc
>>1/ 5 paxos
>>0/ 5 tp
>>1/ 5 auth
>>1/ 5 crypto
>>1/ 1 finisher
>>1/ 5 heartbeatmap
>>1/ 5 perfcounter
>>1/ 5 rgw
>>1/10 civetweb
>>1/ 5 javaclient
>>1/ 5 asok
>>1/ 1 throttle
>>0/ 0 refs
>>1/ 5 xio
>>1/ 5 compressor
>>1/ 5 newstore
>>1/ 5 bluestore
>>1/ 5 bluefs
>>1/ 3 bdev
>>1/ 5 kstore
>>4/ 5 rocksdb
>>4/ 5 leveldb
>>1/ 5 kinetic
>>1/ 5 fuse
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent 1
>>   max_new 1000
>>   log_file /var/log/ceph/ceph-mds.B.log
>>
>>
>>
>> When I was googling around, I ran into this Cern presentation and tried
>> out the offline backware scrubbing commands on slide 25 first:
>>
>> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf
>>
>>
>> Both ran without any messages, so I'm assuming I have sane contents in
>> the cephfs_data and cephfs_metadata pools.  Still no luck getting things
>> restarted, so I tried the cephfs-journal-tool journal reset on slide
>> 23.  That didn't work either.  Just for giggles, I tried setting up the
>> two Ryzen boxes as new mds.C and mds.D servers which would run on
>> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
>> with the same assert as follows:
>
> Because this system was running multiple active MDSs on Jewel (based
> on seeing an EImportStart journal entry), and that was known to be
> unstable, I would advise you to blow away the filesystem and create a
> fresh one using luminous (where multi-mds is stable), rather than
> trying to debug it.  Going back to try and work out what went wrong
> with Jewel code is probably not a very valuable activity unless you
> have irreplacable data.
>
> If you do want to get this filesystem back on its feet in-place:
> (first stopping all MDSs) I'm guessing that your cephfs-journal-tool
> reset didn't help because you had multiple MDS ranks, and that tool
> just operates on rank 0 by 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
or this:

   {
"shard_id": 22,
"entries": [
{
"id": "1_1507761448.758184_10459.1",
"section": "data",
"name":
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3/Wireshark-win64-2.2.7.exe",
"timestamp": "2017-10-11 22:37:28.758184Z",
"info": {
"source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync object"
}
}
]
},



Virenfrei.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Thu, Oct 12, 2017 at 12:39 AM, Enrico Kern 
wrote:

> its 45MB, but it happens with all multipart uploads.
>
> sync error list shows
>
>{
> "shard_id": 31,
> "entries": [
> {
> "id": "1_1507761459.607008_8197.1",
> "section": "data",
> "name": "testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.
> 21344646.3",
> "timestamp": "2017-10-11 22:37:39.607008Z",
> "info": {
> "source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
> "error_code": 5,
> "message": "failed to sync bucket instance: (5)
> Input/output error"
> }
> }
> ]
> }
>
> for multiple shards not just this one
>
>
>
> On Thu, Oct 12, 2017 at 12:31 AM, Yehuda Sadeh-Weinraub  > wrote:
>
>> What is the size of the object? Is it only this one?
>>
>> Try this command: 'radosgw-admin sync error list'. Does it show anything
>> related to that object?
>>
>> Thanks,
>> Yehuda
>>
>>
>> On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern > > wrote:
>>
>>> if i change permissions the sync status shows that it is syncing 1
>>> shard, but no files ends up in the pool (testing with empty data pool).
>>> after a while it shows that data is back in sync but there is no file
>>>
>>> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <
>>> yeh...@redhat.com> wrote:
>>>
 Thanks for your report. We're looking into it. You can try to see if
 touching the object (e.g., modifying its permissions) triggers the sync.

 Yehuda

 On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <
 enrico.k...@glispamedia.com> wrote:

> Hi David,
>
> yeah seems you are right, they are stored as different filenames in
> the data bucket when using multisite upload. But anyway it stil doesnt get
> replicated. As example i have files like
>
> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>
> in the data pool on one zone. But its not replicated to the other
> zone. naming is not relevant, the other data bucket doesnt have any file
> multipart or not.
>
> im really missing the file on the other zone.
>
>
> 
>  Virenfrei.
> www.avg.com
> 
> <#m_-3657272773285512991_m_-7857162703559898269_m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
> wrote:
>
>> Multipart is a client side setting when uploading.  Multisite in and
>> of itself is a client and it doesn't use multipart (at least not by
>> default).  I have a Jewel RGW Multisite cluster and one site has the 
>> object
>> as multi-part while the second site just has it as a single object.  I 
>> had
>> to change from looking at the objects in the pool for monitoring to 
>> looking
>> at an ls of the buckets to see if they were in sync.
>>
>> I don't know if multisite has the option to match if an object is
>> multipart between sites, but it definitely doesn't seem to be the default
>> behavior.
>>
>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
>> enrico.k...@glispamedia.com> wrote:
>>
>>> Hi all,
>>>
>>> i just setup multisite replication according to the docs from
>>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>>> works except that if a client uploads via multipart the files dont get
>>> replicated.
>>>
>>> If i in one zone rename a file that was uploaded via multipart it
>>> gets replicated, but not if i left it untouched. Any ideas why? I 
>>> remember
>>> there was a 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
its 45MB, but it happens with all multipart uploads.

sync error list shows

   {
"shard_id": 31,
"entries": [
{
"id": "1_1507761459.607008_8197.1",
"section": "data",
"name":
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3",
"timestamp": "2017-10-11 22:37:39.607008Z",
"info": {
"source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync bucket instance: (5)
Input/output error"
}
}
]
}

for multiple shards not just this one



On Thu, Oct 12, 2017 at 12:31 AM, Yehuda Sadeh-Weinraub 
wrote:

> What is the size of the object? Is it only this one?
>
> Try this command: 'radosgw-admin sync error list'. Does it show anything
> related to that object?
>
> Thanks,
> Yehuda
>
>
> On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern 
> wrote:
>
>> if i change permissions the sync status shows that it is syncing 1 shard,
>> but no files ends up in the pool (testing with empty data pool). after a
>> while it shows that data is back in sync but there is no file
>>
>> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <
>> yeh...@redhat.com> wrote:
>>
>>> Thanks for your report. We're looking into it. You can try to see if
>>> touching the object (e.g., modifying its permissions) triggers the sync.
>>>
>>> Yehuda
>>>
>>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <
>>> enrico.k...@glispamedia.com> wrote:
>>>
 Hi David,

 yeah seems you are right, they are stored as different filenames in the
 data bucket when using multisite upload. But anyway it stil doesnt get
 replicated. As example i have files like

 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
 ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6

 in the data pool on one zone. But its not replicated to the other zone.
 naming is not relevant, the other data bucket doesnt have any file
 multipart or not.

 im really missing the file on the other zone.


 
  Virenfrei.
 www.avg.com
 
 <#m_-7857162703559898269_m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

 On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
 wrote:

> Multipart is a client side setting when uploading.  Multisite in and
> of itself is a client and it doesn't use multipart (at least not by
> default).  I have a Jewel RGW Multisite cluster and one site has the 
> object
> as multi-part while the second site just has it as a single object.  I had
> to change from looking at the objects in the pool for monitoring to 
> looking
> at an ls of the buckets to see if they were in sync.
>
> I don't know if multisite has the option to match if an object is
> multipart between sites, but it definitely doesn't seem to be the default
> behavior.
>
> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
> enrico.k...@glispamedia.com> wrote:
>
>> Hi all,
>>
>> i just setup multisite replication according to the docs from
>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>> works except that if a client uploads via multipart the files dont get
>> replicated.
>>
>> If i in one zone rename a file that was uploaded via multipart it
>> gets replicated, but not if i left it untouched. Any ideas why? I 
>> remember
>> there was a similar bug with jewel a while back.
>>
>> On the slave node i also permanently get this error (unrelated to the
>> replication) in the radosgw log:
>>
>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>> directory
>>
>> we didnt run radosgw before the luminous upgrade of our clusters.
>>
>> after a finished multipart upload which is only visible at one zone
>> "radosgw-admin sync status" just shows that metadata and data is caught 
>> up
>> with the source.
>>
>>
>>
>> --
>>
>> Enrico Kern
>> *Lead System Engineer*
>>
>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>> 26814501 <+49%201522%206814501>
>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>> Profile 
>>
>>
>>
>> *Glispa GmbH* - Berlin Office
>> Sonnenburger Str. 73 10437 Berlin, Germany
>> 
>>
>> Managing 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Yehuda Sadeh-Weinraub
What is the size of the object? Is it only this one?

Try this command: 'radosgw-admin sync error list'. Does it show anything
related to that object?

Thanks,
Yehuda


On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern 
wrote:

> if i change permissions the sync status shows that it is syncing 1 shard,
> but no files ends up in the pool (testing with empty data pool). after a
> while it shows that data is back in sync but there is no file
>
> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub  > wrote:
>
>> Thanks for your report. We're looking into it. You can try to see if
>> touching the object (e.g., modifying its permissions) triggers the sync.
>>
>> Yehuda
>>
>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern > > wrote:
>>
>>> Hi David,
>>>
>>> yeah seems you are right, they are stored as different filenames in the
>>> data bucket when using multisite upload. But anyway it stil doesnt get
>>> replicated. As example i have files like
>>>
>>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
>>> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>>
>>> in the data pool on one zone. But its not replicated to the other zone.
>>> naming is not relevant, the other data bucket doesnt have any file
>>> multipart or not.
>>>
>>> im really missing the file on the other zone.
>>>
>>>
>>> 
>>>  Virenfrei.
>>> www.avg.com
>>> 
>>> <#m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
>>> wrote:
>>>
 Multipart is a client side setting when uploading.  Multisite in and of
 itself is a client and it doesn't use multipart (at least not by default).
 I have a Jewel RGW Multisite cluster and one site has the object as
 multi-part while the second site just has it as a single object.  I had to
 change from looking at the objects in the pool for monitoring to looking at
 an ls of the buckets to see if they were in sync.

 I don't know if multisite has the option to match if an object is
 multipart between sites, but it definitely doesn't seem to be the default
 behavior.

 On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
 enrico.k...@glispamedia.com> wrote:

> Hi all,
>
> i just setup multisite replication according to the docs from
> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
> works except that if a client uploads via multipart the files dont get
> replicated.
>
> If i in one zone rename a file that was uploaded via multipart it gets
> replicated, but not if i left it untouched. Any ideas why? I remember 
> there
> was a similar bug with jewel a while back.
>
> On the slave node i also permanently get this error (unrelated to the
> replication) in the radosgw log:
>
>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
> directory
>
> we didnt run radosgw before the luminous upgrade of our clusters.
>
> after a finished multipart upload which is only visible at one zone
> "radosgw-admin sync status" just shows that metadata and data is caught up
> with the source.
>
>
>
> --
>
> Enrico Kern
> *Lead System Engineer*
>
> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
> 26814501 <+49%201522%206814501>
> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
> Profile 
>
>
>
> *Glispa GmbH* - Berlin Office
> Sonnenburger Str. 73 10437 Berlin, Germany
> 
>
> Managing Director: David Brown, Registered in Berlin, AG
> Charlottenburg HRB 114678B
>    
> 
> 
>
> 
> 
>
>
> 
>  Virenfrei.
> www.avg.com
> 
> <#m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
in addition i noticed that if you delete a bucket that had multipart upload
files which were not replicated in it that the files are not deleted in the
pool, while the bucket is gone the data stil remains in the pool where the
multipart upload was initiated.

On Thu, Oct 12, 2017 at 12:26 AM, Enrico Kern 
wrote:

> if i change permissions the sync status shows that it is syncing 1 shard,
> but no files ends up in the pool (testing with empty data pool). after a
> while it shows that data is back in sync but there is no file
>
> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub  > wrote:
>
>> Thanks for your report. We're looking into it. You can try to see if
>> touching the object (e.g., modifying its permissions) triggers the sync.
>>
>> Yehuda
>>
>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern > > wrote:
>>
>>> Hi David,
>>>
>>> yeah seems you are right, they are stored as different filenames in the
>>> data bucket when using multisite upload. But anyway it stil doesnt get
>>> replicated. As example i have files like
>>>
>>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
>>> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>>
>>> in the data pool on one zone. But its not replicated to the other zone.
>>> naming is not relevant, the other data bucket doesnt have any file
>>> multipart or not.
>>>
>>> im really missing the file on the other zone.
>>>
>>>
>>> 
>>>  Virenfrei.
>>> www.avg.com
>>> 
>>> <#m_-9020102307170313134_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
>>> wrote:
>>>
 Multipart is a client side setting when uploading.  Multisite in and of
 itself is a client and it doesn't use multipart (at least not by default).
 I have a Jewel RGW Multisite cluster and one site has the object as
 multi-part while the second site just has it as a single object.  I had to
 change from looking at the objects in the pool for monitoring to looking at
 an ls of the buckets to see if they were in sync.

 I don't know if multisite has the option to match if an object is
 multipart between sites, but it definitely doesn't seem to be the default
 behavior.

 On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
 enrico.k...@glispamedia.com> wrote:

> Hi all,
>
> i just setup multisite replication according to the docs from
> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
> works except that if a client uploads via multipart the files dont get
> replicated.
>
> If i in one zone rename a file that was uploaded via multipart it gets
> replicated, but not if i left it untouched. Any ideas why? I remember 
> there
> was a similar bug with jewel a while back.
>
> On the slave node i also permanently get this error (unrelated to the
> replication) in the radosgw log:
>
>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
> directory
>
> we didnt run radosgw before the luminous upgrade of our clusters.
>
> after a finished multipart upload which is only visible at one zone
> "radosgw-admin sync status" just shows that metadata and data is caught up
> with the source.
>
>
>
> --
>
> Enrico Kern
> *Lead System Engineer*
>
> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
> 26814501 <+49%201522%206814501>
> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
> Profile 
>
>
>
> *Glispa GmbH* - Berlin Office
> Sonnenburger Str. 73 10437 Berlin, Germany
> 
>
> Managing Director: David Brown, Registered in Berlin, AG
> Charlottenburg HRB 114678B
>    
> 
> 
>
> 
> 
>
>
> 
>  Virenfrei.
> www.avg.com
> 
> <#m_-9020102307170313134_m_-2032373670326744902_m_1117069282601502036_m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
if i change permissions the sync status shows that it is syncing 1 shard,
but no files ends up in the pool (testing with empty data pool). after a
while it shows that data is back in sync but there is no file

On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub 
wrote:

> Thanks for your report. We're looking into it. You can try to see if
> touching the object (e.g., modifying its permissions) triggers the sync.
>
> Yehuda
>
> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern 
> wrote:
>
>> Hi David,
>>
>> yeah seems you are right, they are stored as different filenames in the
>> data bucket when using multisite upload. But anyway it stil doesnt get
>> replicated. As example i have files like
>>
>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_
>> Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>
>> in the data pool on one zone. But its not replicated to the other zone.
>> naming is not relevant, the other data bucket doesnt have any file
>> multipart or not.
>>
>> im really missing the file on the other zone.
>>
>>
>> 
>>  Virenfrei.
>> www.avg.com
>> 
>> <#m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
>> wrote:
>>
>>> Multipart is a client side setting when uploading.  Multisite in and of
>>> itself is a client and it doesn't use multipart (at least not by default).
>>> I have a Jewel RGW Multisite cluster and one site has the object as
>>> multi-part while the second site just has it as a single object.  I had to
>>> change from looking at the objects in the pool for monitoring to looking at
>>> an ls of the buckets to see if they were in sync.
>>>
>>> I don't know if multisite has the option to match if an object is
>>> multipart between sites, but it definitely doesn't seem to be the default
>>> behavior.
>>>
>>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern 
>>> wrote:
>>>
 Hi all,

 i just setup multisite replication according to the docs from
 http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
 works except that if a client uploads via multipart the files dont get
 replicated.

 If i in one zone rename a file that was uploaded via multipart it gets
 replicated, but not if i left it untouched. Any ideas why? I remember there
 was a similar bug with jewel a while back.

 On the slave node i also permanently get this error (unrelated to the
 replication) in the radosgw log:

  meta sync: ERROR: failed to read mdlog info with (2) No such file or
 directory

 we didnt run radosgw before the luminous upgrade of our clusters.

 after a finished multipart upload which is only visible at one zone
 "radosgw-admin sync status" just shows that metadata and data is caught up
 with the source.



 --

 Enrico Kern
 *Lead System Engineer*

 *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
 26814501 <+49%201522%206814501>
 *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
 Profile 



 *Glispa GmbH* - Berlin Office
 Sonnenburger Str. 73 10437 Berlin, Germany
 

 Managing Director: David Brown, Registered in Berlin, AG
 Charlottenburg HRB 114678B
    
 
 

 
 


 
  Virenfrei.
 www.avg.com
 
 <#m_-2032373670326744902_m_1117069282601502036_m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>>
>> --
>>
>> Enrico Kern
>> *Lead System Engineer*
>>
>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>> 26814501 <+49%201522%206814501>
>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>> Profile 
>>
>>
>>
>> *Glispa GmbH* - Berlin Office
>> Sonnenburger Str. 73 

Re: [ceph-users] RGW flush_read_list error

2017-10-11 Thread Travis Nielsen
To the client they were showing up as a 500 error. Ty, do you know of any
client-side issues that could have come up during the test run? And there
was only a single GET happening at a time, right?




On 10/11/17, 9:27 AM, "ceph-users on behalf of Casey Bodley"
 wrote:

>Hi Travis,
>
>This is reporting an error when sending data back to the client.
>Generally it means that the client timed out and closed the connection.
>Are you also seeing failures on the client side?
>
>Casey
>
>
>On 10/10/2017 06:45 PM, Travis Nielsen wrote:
>> In Luminous 12.2.1, when running a GET on a large (1GB file) repeatedly
>> for an hour from RGW, the following error was hit intermittently a
>>number
>> of times. The first error was hit after 45 minutes and then the error
>> happened frequently for the remainder of the test.
>>
>> ERROR: flush_read_list(): d->client_cb->handle_data() returned -5
>>
>> Here is some more context from the rgw log around one of the failures.
>>
>> 2017-10-10 18:20:32.321681 I | rgw: 2017-10-10 18:20:32.321643
>> 7f8929f41700 1 civetweb: 0x55bd25899000: 10.32.0.1 - -
>> [10/Oct/2017:18:19:07 +] "GET /bucket100/testfile.tst HTTP/1.1" 1 0
>>-
>> aws-sdk-java/1.9.0 Linux/4.4.0-93-generic
>> OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
>> 2017-10-10 18:20:32.383855 I | rgw: 2017-10-10 18:20:32.383786
>> 7f8924736700 1 == starting new request req=0x7f892472f140 =
>> 2017-10-10 18:20:46.605668 I | rgw: 2017-10-10 18:20:46.605576
>> 7f894af83700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
>> returned -5
>> 2017-10-10 18:20:46.605934 I | rgw: 2017-10-10 18:20:46.605914
>> 7f894af83700 1 == req done req=0x7f894af7c140 op status=-5
>> http_status=200 ==
>> 2017-10-10 18:20:46.606249 I | rgw: 2017-10-10 18:20:46.606225
>> 7f8924736700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
>> returned -5
>>
>> I don't see anything else standing out in the log. The object store was
>> configured with an erasure-coded data pool with k=2 and m=1.
>>
>> There are a number of threads around this, but I don't see a resolution.
>> Is there a tracking issue for this?
>> 
>>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.cep
>>h.com%2Fpipermail%2Fceph-users-ceph.com%2F2016-February%2F007756.ht=
>>02%7C01%7CTravis.Nielsen%40quantum.com%7C5ba068e75938455da6a408d510c50ddd
>>%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636433360880521631=Wb
>>EdpMEB%2BvjZS%2BxclppC3%2BHALu6iayzwjTQeFK3qMp8%3D=0
>> ml
>> 
>>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spin
>>ics.net%2Flists%2Fceph-users%2Fmsg16117.html=02%7C01%7CTravis.Nielse
>>n%40quantum.com%7C5ba068e75938455da6a408d510c50ddd%7C322a135f14fb4d72aede
>>122272134ae0%7C1%7C0%7C636433360880521631=5PSDwmEnZB7g9atCeRlZvTPUX
>>wHB3c1bFjiDt7VfkKI%3D=0
>> 
>>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spin
>>ics.net%2Flists%2Fceph-devel%2Fmsg37657.html=02%7C01%7CTravis.Nielse
>>n%40quantum.com%7C5ba068e75938455da6a408d510c50ddd%7C322a135f14fb4d72aede
>>122272134ae0%7C1%7C0%7C636433360880521631=4oEXugLrXjmIPRnz4LAiavbMF
>>kgUnEj5jBw%2F%2Bk9BYJE%3D=0
>>
>>
>> Here's our tracking Rook issue.
>> 
>>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.c
>>om%2Frook%2Frook%2Fissues%2F1067=02%7C01%7CTravis.Nielsen%40quantum.
>>com%7C5ba068e75938455da6a408d510c50ddd%7C322a135f14fb4d72aede122272134ae0
>>%7C1%7C0%7C636433360880521631=pYi3%2FZupNoy7Act1bQomEee6seO%2BKDt%2
>>BzkgzmcYeJV4%3D=0
>>
>>
>> Thanks,
>> Travis
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Yehuda Sadeh-Weinraub
Thanks for your report. We're looking into it. You can try to see if
touching the object (e.g., modifying its permissions) triggers the sync.

Yehuda

On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern 
wrote:

> Hi David,
>
> yeah seems you are right, they are stored as different filenames in the
> data bucket when using multisite upload. But anyway it stil doesnt get
> replicated. As example i have files like
>
> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__
> multipart_Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>
> in the data pool on one zone. But its not replicated to the other zone.
> naming is not relevant, the other data bucket doesnt have any file
> multipart or not.
>
> im really missing the file on the other zone.
>
>
> 
>  Virenfrei.
> www.avg.com
> 
> <#m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
> wrote:
>
>> Multipart is a client side setting when uploading.  Multisite in and of
>> itself is a client and it doesn't use multipart (at least not by default).
>> I have a Jewel RGW Multisite cluster and one site has the object as
>> multi-part while the second site just has it as a single object.  I had to
>> change from looking at the objects in the pool for monitoring to looking at
>> an ls of the buckets to see if they were in sync.
>>
>> I don't know if multisite has the option to match if an object is
>> multipart between sites, but it definitely doesn't seem to be the default
>> behavior.
>>
>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern 
>> wrote:
>>
>>> Hi all,
>>>
>>> i just setup multisite replication according to the docs from
>>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>>> works except that if a client uploads via multipart the files dont get
>>> replicated.
>>>
>>> If i in one zone rename a file that was uploaded via multipart it gets
>>> replicated, but not if i left it untouched. Any ideas why? I remember there
>>> was a similar bug with jewel a while back.
>>>
>>> On the slave node i also permanently get this error (unrelated to the
>>> replication) in the radosgw log:
>>>
>>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>>> directory
>>>
>>> we didnt run radosgw before the luminous upgrade of our clusters.
>>>
>>> after a finished multipart upload which is only visible at one zone
>>> "radosgw-admin sync status" just shows that metadata and data is caught up
>>> with the source.
>>>
>>>
>>>
>>> --
>>>
>>> Enrico Kern
>>> *Lead System Engineer*
>>>
>>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>>> 26814501 <+49%201522%206814501>
>>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>>> Profile 
>>>
>>>
>>>
>>> *Glispa GmbH* - Berlin Office
>>> Sonnenburger Str. 73 10437 Berlin, Germany
>>> 
>>>
>>> Managing Director: David Brown, Registered in Berlin, AG Charlottenburg
>>> HRB 114678B
>>>    
>>> 
>>> 
>>>
>>> 
>>> 
>>>
>>>
>>> 
>>>  Virenfrei.
>>> www.avg.com
>>> 
>>> <#m_1117069282601502036_m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
>
> Enrico Kern
> *Lead System Engineer*
>
> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152 26814501
> <+49%201522%206814501>
> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile
> 
>
>
>
> *Glispa GmbH* - Berlin Office
> Sonnenburger Str. 73 10437 Berlin, Germany
> Managing Director: Dina Karol-Gavish, Registered in Berlin, AG
> Charlottenburg HRB 114678B
>    
> 
> 
>   
>
>
> 

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
Hi David,

yeah seems you are right, they are stored as different filenames in the
data bucket when using multisite upload. But anyway it stil doesnt get
replicated. As example i have files like

6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6

in the data pool on one zone. But its not replicated to the other zone.
naming is not relevant, the other data bucket doesnt have any file
multipart or not.

im really missing the file on the other zone.


Virenfrei.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Oct 11, 2017 at 10:25 PM, David Turner 
wrote:

> Multipart is a client side setting when uploading.  Multisite in and of
> itself is a client and it doesn't use multipart (at least not by default).
> I have a Jewel RGW Multisite cluster and one site has the object as
> multi-part while the second site just has it as a single object.  I had to
> change from looking at the objects in the pool for monitoring to looking at
> an ls of the buckets to see if they were in sync.
>
> I don't know if multisite has the option to match if an object is
> multipart between sites, but it definitely doesn't seem to be the default
> behavior.
>
> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern 
> wrote:
>
>> Hi all,
>>
>> i just setup multisite replication according to the docs from
>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything works
>> except that if a client uploads via multipart the files dont get replicated.
>>
>> If i in one zone rename a file that was uploaded via multipart it gets
>> replicated, but not if i left it untouched. Any ideas why? I remember there
>> was a similar bug with jewel a while back.
>>
>> On the slave node i also permanently get this error (unrelated to the
>> replication) in the radosgw log:
>>
>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>> directory
>>
>> we didnt run radosgw before the luminous upgrade of our clusters.
>>
>> after a finished multipart upload which is only visible at one zone
>> "radosgw-admin sync status" just shows that metadata and data is caught up
>> with the source.
>>
>>
>>
>> --
>>
>> Enrico Kern
>> *Lead System Engineer*
>>
>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>> 26814501 <+49%201522%206814501>
>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>> Profile 
>>
>>
>>
>> *Glispa GmbH* - Berlin Office
>> Sonnenburger Str. 73 10437 Berlin, Germany
>> 
>>
>> Managing Director: David Brown, Registered in Berlin, AG Charlottenburg
>> HRB 114678B
>>    
>> 
>> 
>>   
>>
>>
>>
>> 
>>  Virenfrei.
>> www.avg.com
>> 
>> <#m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 

Enrico Kern
*Lead System Engineer*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile




*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: Dina Karol-Gavish, Registered in Berlin, AG
Charlottenburg HRB 114678B
   


      

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
Hi all,

i just setup multisite replication according to the docs from
http://docs.ceph.com/docs/master/radosgw/multisite/ and everything works
except that if a client uploads via multipart the files dont get replicated.

If i in one zone rename a file that was uploaded via multipart it gets
replicated, but not if i left it untouched. Any ideas why? I remember there
was a similar bug with jewel a while back.

On the slave node i also permanently get this error (unrelated to the
replication) in the radosgw log:

 meta sync: ERROR: failed to read mdlog info with (2) No such file or
directory

we didnt run radosgw before the luminous upgrade of our clusters.

after a finished multipart upload which is only visible at one zone
"radosgw-admin sync status" just shows that metadata and data is caught up
with the source.



-- 

Enrico Kern
*Lead System Engineer*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile




*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: David Brown, Registered in Berlin, AG Charlottenburg HRB
114678B
   


      



Virenfrei.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Samuel Soulard
Ahh so, in this case, only Suse Enterprise Storage is able to provide ISCSI
connections of MS Clusters if an HA is required be it Active/Standby,
Active/Active or Active/Failover.

On Wed, Oct 11, 2017 at 2:03 PM, Jason Dillaman  wrote:

> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> PGRs
> > (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> > gateways which have shared storage to store the LIO configuration and PGR
> > data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL)
> support in LIO where it writes the PR metadata to
> "/var/target/pr/aptpl_"? I suppose that would work for a
> Pacemaker failover if you had a shared file system mounted between all
> your gateways *and* the initiator requests APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a port
> on
> > another ISCSI gateway?  I believe LIO with multiple target portal IP on
> the
> > same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways
> which doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> ISCSI
> > gateway available through 2 target portal IP (for data path
> redundancy).  If
> > this first ISCSI gateway fails, both target portal IP failover to the
> > standby node with the PGR data that is available on share stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >>  wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that is,
> >> > RBD
> >> > block device mounted on the ISCSI gateway and published through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.  This
> >> > would
> >> > theoretically provide support for PGRs since LIO does support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node throughput,
> >> > but
> >> > this would achieve high availability required by some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't here
> >> > right
> >> > now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it
> >> relies on client-side failure timers that are not coordinated with the
> >> target.
> >>
> >> For example, if an initiator writes to sector X down path A and there
> >> is delay to the path A target (i.e. the target and initiator timeout
> >> timers are not in-sync), and MPIO fails over to path B, quickly
> >> performs the write to sector X and performs second write to sector X,
> >> there is a possibility that eventually path A will unblock and
> >> overwrite the new value in sector 1 with the old value. The safe way
> >> to handle that would require setting the initiator-side IO timeouts to
> >> such high values as to cause higher-level subsystems to mark the MPIO
> >> path as failed should a failure actually occur.
> >>
> >> The iSCSI MCS protocol would address these concerns since in theory
> >> path B could discover that the retried IO was actually a retry, but
> >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> >> initiators.
> >>
> >> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
> >> > wrote:
> >> >>
> >> >> Hi Jason,
> >> >>
> >> >> Thanks for the detailed write-up...
> >> >>
> >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >> >>
> >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> > > As far as I am able to understand there are 2 ways of setting
> iscsi
> >> >> > > for
> >> >> > > ceph
> >> >> > >
> >> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> >> >> > >
> >> >> >
> >> >> > The target_core_rbd approach is only utilized by SUSE (and its
> >> >> > derivatives
> >> >> > like PetaSAN) as far as I know. This was the initial approach for
> Red
> >> >> > Hat-derived kernels as well until the upstream kernel maintainers
> >> >> > 

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Jason Dillaman
On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
 wrote:
> Hmmm, If you failover the identity of the LIO configuration including PGRs
> (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> gateways which have shared storage to store the LIO configuration and PGR
> data.

Are you referring to the Active Persist Through Power Loss (APTPL)
support in LIO where it writes the PR metadata to
"/var/target/pr/aptpl_"? I suppose that would work for a
Pacemaker failover if you had a shared file system mounted between all
your gateways *and* the initiator requests APTPL mode(?).

> Also, you said another "fails over to another port", do you mean a port on
> another ISCSI gateway?  I believe LIO with multiple target portal IP on the
> same node for path redundancy works with PGRs.

Yes, I was referring to the case with multiple active iSCSI gateways
which doesn't currently distribute PGRs to all gateways in the group.

> In my scenario, if my assumptions are correct, you would only have 1 ISCSI
> gateway available through 2 target portal IP (for data path redundancy).  If
> this first ISCSI gateway fails, both target portal IP failover to the
> standby node with the PGR data that is available on share stored.
>
>
> Sam
>
> On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> wrote:
>>
>> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
>>  wrote:
>> > Hi to all,
>> >
>> > What if you're using an ISCSI gateway based on LIO and KRBD (that is,
>> > RBD
>> > block device mounted on the ISCSI gateway and published through LIO).
>> > The
>> > LIO target portal (virtual IP) would failover to another node.  This
>> > would
>> > theoretically provide support for PGRs since LIO does support SPC-3.
>> > Granted it is not distributed and limited to 1 single node throughput,
>> > but
>> > this would achieve high availability required by some environment.
>>
>> Yes, LIO technically supports PGR but it's not distributed to other
>> nodes. If you have a pacemaker-initiated target failover to another
>> node, the PGR state would be lost / missing after migration (unless I
>> am missing something like a resource agent that attempts to preserve
>> the PGRs). For initiator-initiated failover (e.g. a target is alive
>> but the initiator cannot reach it), after it fails over to another
>> port the PGR data won't be available.
>>
>> > Of course, multiple target portal would be awesome since available
>> > throughput would be able to scale linearly, but since this isn't here
>> > right
>> > now, this would provide at least an alternative.
>>
>> It would definitely be great to go active/active but there are
>> concerns of data-corrupting edge conditions when using MPIO since it
>> relies on client-side failure timers that are not coordinated with the
>> target.
>>
>> For example, if an initiator writes to sector X down path A and there
>> is delay to the path A target (i.e. the target and initiator timeout
>> timers are not in-sync), and MPIO fails over to path B, quickly
>> performs the write to sector X and performs second write to sector X,
>> there is a possibility that eventually path A will unblock and
>> overwrite the new value in sector 1 with the old value. The safe way
>> to handle that would require setting the initiator-side IO timeouts to
>> such high values as to cause higher-level subsystems to mark the MPIO
>> path as failed should a failure actually occur.
>>
>> The iSCSI MCS protocol would address these concerns since in theory
>> path B could discover that the retried IO was actually a retry, but
>> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
>> initiators.
>>
>> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
>> > wrote:
>> >>
>> >> Hi Jason,
>> >>
>> >> Thanks for the detailed write-up...
>> >>
>> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
>> >>
>> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
>> >> > 
>> >> > wrote:
>> >> >
>> >> > > As far as I am able to understand there are 2 ways of setting iscsi
>> >> > > for
>> >> > > ceph
>> >> > >
>> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>> >> > >
>> >> >
>> >> > The target_core_rbd approach is only utilized by SUSE (and its
>> >> > derivatives
>> >> > like PetaSAN) as far as I know. This was the initial approach for Red
>> >> > Hat-derived kernels as well until the upstream kernel maintainers
>> >> > indicated
>> >> > that they really do not want a specialized target backend for just
>> >> > krbd.
>> >> > The next attempt was to re-use the existing target_core_iblock to
>> >> > interface
>> >> > with krbd via the kernel's block layer, but that hit similar upstream
>> >> > walls
>> >> > trying to get support for SCSI command passthrough to the block
>> >> > layer.
>> >> >
>> >> >
>> >> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>> >> > >
>> >> >
>> >> 

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Samuel Soulard
Hmmm, If you failover the identity of the LIO configuration including PGRs
(I believe they are files on disk), this would work no?  Using an 2 ISCSI
gateways which have shared storage to store the LIO configuration and PGR
data.

Also, you said another "fails over to another port", do you mean a port on
another ISCSI gateway?  I believe LIO with multiple target portal IP on the
same node for path redundancy works with PGRs.

In my scenario, if my assumptions are correct, you would only have 1 ISCSI
gateway available through 2 target portal IP (for data path redundancy).
If this first ISCSI gateway fails, both target portal IP failover to the
standby node with the PGR data that is available on share stored.


Sam

On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
wrote:

> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
>  wrote:
> > Hi to all,
> >
> > What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD
> > block device mounted on the ISCSI gateway and published through LIO).
> The
> > LIO target portal (virtual IP) would failover to another node.  This
> would
> > theoretically provide support for PGRs since LIO does support SPC-3.
> > Granted it is not distributed and limited to 1 single node throughput,
> but
> > this would achieve high availability required by some environment.
>
> Yes, LIO technically supports PGR but it's not distributed to other
> nodes. If you have a pacemaker-initiated target failover to another
> node, the PGR state would be lost / missing after migration (unless I
> am missing something like a resource agent that attempts to preserve
> the PGRs). For initiator-initiated failover (e.g. a target is alive
> but the initiator cannot reach it), after it fails over to another
> port the PGR data won't be available.
>
> > Of course, multiple target portal would be awesome since available
> > throughput would be able to scale linearly, but since this isn't here
> right
> > now, this would provide at least an alternative.
>
> It would definitely be great to go active/active but there are
> concerns of data-corrupting edge conditions when using MPIO since it
> relies on client-side failure timers that are not coordinated with the
> target.
>
> For example, if an initiator writes to sector X down path A and there
> is delay to the path A target (i.e. the target and initiator timeout
> timers are not in-sync), and MPIO fails over to path B, quickly
> performs the write to sector X and performs second write to sector X,
> there is a possibility that eventually path A will unblock and
> overwrite the new value in sector 1 with the old value. The safe way
> to handle that would require setting the initiator-side IO timeouts to
> such high values as to cause higher-level subsystems to mark the MPIO
> path as failed should a failure actually occur.
>
> The iSCSI MCS protocol would address these concerns since in theory
> path B could discover that the retried IO was actually a retry, but
> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> initiators.
>
> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
> wrote:
> >>
> >> Hi Jason,
> >>
> >> Thanks for the detailed write-up...
> >>
> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >>
> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López <
> jorp...@unizar.es>
> >> > wrote:
> >> >
> >> > > As far as I am able to understand there are 2 ways of setting iscsi
> >> > > for
> >> > > ceph
> >> > >
> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> >> > >
> >> >
> >> > The target_core_rbd approach is only utilized by SUSE (and its
> >> > derivatives
> >> > like PetaSAN) as far as I know. This was the initial approach for Red
> >> > Hat-derived kernels as well until the upstream kernel maintainers
> >> > indicated
> >> > that they really do not want a specialized target backend for just
> krbd.
> >> > The next attempt was to re-use the existing target_core_iblock to
> >> > interface
> >> > with krbd via the kernel's block layer, but that hit similar upstream
> >> > walls
> >> > trying to get support for SCSI command passthrough to the block layer.
> >> >
> >> >
> >> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> >> > >
> >> >
> >> > The TCMU approach is what upstream and Red Hat-derived kernels will
> >> > support
> >> > going forward.
> >>
> >> SUSE is also in the process of migrating to the upstream tcmu approach,
> >> for the reasons that you gave in (1).
> >>
> >> ...
> >>
> >> > The TCMU approach also does not currently support SCSI persistent
> >> > reservation groups (needed for Windows clustering) because that
> support
> >> > isn't available in the upstream kernel. The SUSE kernel has an
> approach
> >> > that utilizes two round-trips to the OSDs for each IO to simulate PGR
> >> > support. Earlier this summer I believe SUSE started to look into how
> to
> >> > get
> >> > generic PGR 

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Jason Dillaman
On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
 wrote:
> Hi to all,
>
> What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD
> block device mounted on the ISCSI gateway and published through LIO).  The
> LIO target portal (virtual IP) would failover to another node.  This would
> theoretically provide support for PGRs since LIO does support SPC-3.
> Granted it is not distributed and limited to 1 single node throughput, but
> this would achieve high availability required by some environment.

Yes, LIO technically supports PGR but it's not distributed to other
nodes. If you have a pacemaker-initiated target failover to another
node, the PGR state would be lost / missing after migration (unless I
am missing something like a resource agent that attempts to preserve
the PGRs). For initiator-initiated failover (e.g. a target is alive
but the initiator cannot reach it), after it fails over to another
port the PGR data won't be available.

> Of course, multiple target portal would be awesome since available
> throughput would be able to scale linearly, but since this isn't here right
> now, this would provide at least an alternative.

It would definitely be great to go active/active but there are
concerns of data-corrupting edge conditions when using MPIO since it
relies on client-side failure timers that are not coordinated with the
target.

For example, if an initiator writes to sector X down path A and there
is delay to the path A target (i.e. the target and initiator timeout
timers are not in-sync), and MPIO fails over to path B, quickly
performs the write to sector X and performs second write to sector X,
there is a possibility that eventually path A will unblock and
overwrite the new value in sector 1 with the old value. The safe way
to handle that would require setting the initiator-side IO timeouts to
such high values as to cause higher-level subsystems to mark the MPIO
path as failed should a failure actually occur.

The iSCSI MCS protocol would address these concerns since in theory
path B could discover that the retried IO was actually a retry, but
alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
initiators.

> On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp  wrote:
>>
>> Hi Jason,
>>
>> Thanks for the detailed write-up...
>>
>> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
>>
>> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
>> > wrote:
>> >
>> > > As far as I am able to understand there are 2 ways of setting iscsi
>> > > for
>> > > ceph
>> > >
>> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>> > >
>> >
>> > The target_core_rbd approach is only utilized by SUSE (and its
>> > derivatives
>> > like PetaSAN) as far as I know. This was the initial approach for Red
>> > Hat-derived kernels as well until the upstream kernel maintainers
>> > indicated
>> > that they really do not want a specialized target backend for just krbd.
>> > The next attempt was to re-use the existing target_core_iblock to
>> > interface
>> > with krbd via the kernel's block layer, but that hit similar upstream
>> > walls
>> > trying to get support for SCSI command passthrough to the block layer.
>> >
>> >
>> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>> > >
>> >
>> > The TCMU approach is what upstream and Red Hat-derived kernels will
>> > support
>> > going forward.
>>
>> SUSE is also in the process of migrating to the upstream tcmu approach,
>> for the reasons that you gave in (1).
>>
>> ...
>>
>> > The TCMU approach also does not currently support SCSI persistent
>> > reservation groups (needed for Windows clustering) because that support
>> > isn't available in the upstream kernel. The SUSE kernel has an approach
>> > that utilizes two round-trips to the OSDs for each IO to simulate PGR
>> > support. Earlier this summer I believe SUSE started to look into how to
>> > get
>> > generic PGR support merged into the upstream kernel using corosync/dlm
>> > to
>> > synchronize the states between multiple nodes in the target. I am not
>> > sure
>> > of the current state of that work, but it would benefit all LIO targets
>> > when complete.
>>
>> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,
>> whether DLM or the underlying Ceph cluster gets used for PR state
>> storage is still under consideration.
>>
>> Cheers, David
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Samuel Soulard
Hi to all,

What if you're using an ISCSI gateway based on LIO and KRBD (that is, RBD
block device mounted on the ISCSI gateway and published through LIO).  The
LIO target portal (virtual IP) would failover to another node.  This would
theoretically provide support for PGRs since LIO does support SPC-3.
Granted it is not distributed and limited to 1 single node throughput, but
this would achieve high availability required by some environment.

Of course, multiple target portal would be awesome since available
throughput would be able to scale linearly, but since this isn't here right
now, this would provide at least an alternative.

On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp  wrote:

> Hi Jason,
>
> Thanks for the detailed write-up...
>
> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
>
> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
> > wrote:
> >
> > > As far as I am able to understand there are 2 ways of setting iscsi for
> > > ceph
> > >
> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> > >
> >
> > The target_core_rbd approach is only utilized by SUSE (and its
> derivatives
> > like PetaSAN) as far as I know. This was the initial approach for Red
> > Hat-derived kernels as well until the upstream kernel maintainers
> indicated
> > that they really do not want a specialized target backend for just krbd.
> > The next attempt was to re-use the existing target_core_iblock to
> interface
> > with krbd via the kernel's block layer, but that hit similar upstream
> walls
> > trying to get support for SCSI command passthrough to the block layer.
> >
> >
> > > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> > >
> >
> > The TCMU approach is what upstream and Red Hat-derived kernels will
> support
> > going forward.
>
> SUSE is also in the process of migrating to the upstream tcmu approach,
> for the reasons that you gave in (1).
>
> ...
>
> > The TCMU approach also does not currently support SCSI persistent
> > reservation groups (needed for Windows clustering) because that support
> > isn't available in the upstream kernel. The SUSE kernel has an approach
> > that utilizes two round-trips to the OSDs for each IO to simulate PGR
> > support. Earlier this summer I believe SUSE started to look into how to
> get
> > generic PGR support merged into the upstream kernel using corosync/dlm to
> > synchronize the states between multiple nodes in the target. I am not
> sure
> > of the current state of that work, but it would benefit all LIO targets
> > when complete.
>
> Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,
> whether DLM or the underlying Ceph cluster gets used for PR state
> storage is still under consideration.
>
> Cheers, David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW flush_read_list error

2017-10-11 Thread Casey Bodley

Hi Travis,

This is reporting an error when sending data back to the client. 
Generally it means that the client timed out and closed the connection. 
Are you also seeing failures on the client side?


Casey


On 10/10/2017 06:45 PM, Travis Nielsen wrote:

In Luminous 12.2.1, when running a GET on a large (1GB file) repeatedly
for an hour from RGW, the following error was hit intermittently a number
of times. The first error was hit after 45 minutes and then the error
happened frequently for the remainder of the test.

ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

Here is some more context from the rgw log around one of the failures.

2017-10-10 18:20:32.321681 I | rgw: 2017-10-10 18:20:32.321643
7f8929f41700 1 civetweb: 0x55bd25899000: 10.32.0.1 - -
[10/Oct/2017:18:19:07 +] "GET /bucket100/testfile.tst HTTP/1.1" 1 0 -
aws-sdk-java/1.9.0 Linux/4.4.0-93-generic
OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
2017-10-10 18:20:32.383855 I | rgw: 2017-10-10 18:20:32.383786
7f8924736700 1 == starting new request req=0x7f892472f140 =
2017-10-10 18:20:46.605668 I | rgw: 2017-10-10 18:20:46.605576
7f894af83700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
returned -5
2017-10-10 18:20:46.605934 I | rgw: 2017-10-10 18:20:46.605914
7f894af83700 1 == req done req=0x7f894af7c140 op status=-5
http_status=200 ==
2017-10-10 18:20:46.606249 I | rgw: 2017-10-10 18:20:46.606225
7f8924736700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
returned -5

I don't see anything else standing out in the log. The object store was
configured with an erasure-coded data pool with k=2 and m=1.

There are a number of threads around this, but I don't see a resolution.
Is there a tracking issue for this?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007756.ht
ml
https://www.spinics.net/lists/ceph-users/msg16117.html
https://www.spinics.net/lists/ceph-devel/msg37657.html


Here's our tracking Rook issue.
https://github.com/rook/rook/issues/1067


Thanks,
Travis



On 10/10/17, 3:05 PM, "ceph-users on behalf of Jack"
 wrote:


Hi,

I would like some information about the following

Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs
My single pool has size=3, min_size=2

For a write-only pattern, I thought I would get SSDs performance level,
because the write would be acked as soon as min_size OSDs acked

But I am right ?

(the same setup could involve some high latency OSDs, in the case of
country-level cluster)
___
ceph-users mailing list
ceph-users@lists.ceph.com
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ceph
.com%2Flistinfo.cgi%2Fceph-users-ceph.com=02%7C01%7CTravis.Nielsen%40
quantum.com%7C16f668da252f4e6f355308d5102b09c1%7C322a135f14fb4d72aede12227
2134ae0%7C1%7C0%7C636432699404298770=tmIMMyQ7ia%2FVmHrSGcF9t4sMpt2bj
dexriEhEg3XUGU%3D=0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread David Disseldorp
Hi Jason,

Thanks for the detailed write-up...

On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:

> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
> wrote:
> 
> > As far as I am able to understand there are 2 ways of setting iscsi for
> > ceph
> >
> > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> >  
> 
> The target_core_rbd approach is only utilized by SUSE (and its derivatives
> like PetaSAN) as far as I know. This was the initial approach for Red
> Hat-derived kernels as well until the upstream kernel maintainers indicated
> that they really do not want a specialized target backend for just krbd.
> The next attempt was to re-use the existing target_core_iblock to interface
> with krbd via the kernel's block layer, but that hit similar upstream walls
> trying to get support for SCSI command passthrough to the block layer.
> 
> 
> > 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> >  
> 
> The TCMU approach is what upstream and Red Hat-derived kernels will support
> going forward.

SUSE is also in the process of migrating to the upstream tcmu approach,
for the reasons that you gave in (1).

...

> The TCMU approach also does not currently support SCSI persistent
> reservation groups (needed for Windows clustering) because that support
> isn't available in the upstream kernel. The SUSE kernel has an approach
> that utilizes two round-trips to the OSDs for each IO to simulate PGR
> support. Earlier this summer I believe SUSE started to look into how to get
> generic PGR support merged into the upstream kernel using corosync/dlm to
> synchronize the states between multiple nodes in the target. I am not sure
> of the current state of that work, but it would benefit all LIO targets
> when complete.

Zhu Lingshan (cc'ed) worked on a prototype for tcmu PR support. IIUC,
whether DLM or the underlying Ceph cluster gets used for PR state
storage is still under consideration.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd disk full (partition 100% used)

2017-10-11 Thread Webert de Souza Lima
That sounds like it. Thanks David.
I wonder if that behavior of ignoring the OSD full_ratio is intentional.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Oct 11, 2017 at 12:26 PM, David Turner 
wrote:

> The full ratio is based on the max bytes.  if you say that the cache
> should have a max bytes of 1TB and that the full ratio is .8, then it will
> aim to keep it at 800GB.  Without a max bytes value set, the ratios are a
> percentage of unlimited... aka no limit themselves.  The full_ratio should
> be respected, but this is the second report of a cache tier reaching 100%
> this month so I'm guessing that the caching mechanisms might ignore those
> OSD settings in preference of the cache tier settings that were set
> improperly.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd disk full (partition 100% used)

2017-10-11 Thread David Turner
The full ratio is based on the max bytes.  if you say that the cache should
have a max bytes of 1TB and that the full ratio is .8, then it will aim to
keep it at 800GB.  Without a max bytes value set, the ratios are a
percentage of unlimited... aka no limit themselves.  The full_ratio should
be respected, but this is the second report of a cache tier reaching 100%
this month so I'm guessing that the caching mechanisms might ignore those
OSD settings in preference of the cache tier settings that were set
improperly.

On Wed, Oct 11, 2017 at 11:16 AM Webert de Souza Lima 
wrote:

> Hi,
>
> I have a cephfs cluster as follows:
>
> 1 15x HDD data pool (primary cephfs data pool)
> 1 2x SSD data pool (linked to a specific dir via xattrs)
> 1 2x SSD metadata pool
> 1 2x SSD cache tier pool
>
> the cache tier pool consists in 2 host, with one SSD OSD on each host,
> with size=2 replicated by host.
> Last night the disks went 100% full and the cluster went down.
>
> I know I made a mistake and set target_max_objects and target_max_bytes to
> 0 in the cache pool,
> but isn't ceph supposed to stop writing to an OSD when it reaches
> it's full_ratio (default 0.95) ?
> And what about the cache_target_full_ratio in the cache tier pool?
>
> Here is the cluster:
>
> ~# ceph fs ls
> name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
> cephfs_data_ssd ]
>
> * the Metadata and the SSD data pools use the same 2 OSDs (one cephfs
> directory is linked to the SSD data pool via xattrs)
>
> ~# ceph -v
> ceph version 10.2.9-4-gbeaec39 (beaec397f00491079cd74f7b9e3e10660859e26b)
>
> ~# ceph osd pool ls detail
> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 136 lfor 115
> flags hashpspool crash_replay_interval 45 tiers 3 read_tier 3 write_tier 3
> stripe_width 0
> pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 2
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 617 flags
> hashpspool stripe_width 0
> pool 3 'cephfs_cache' replicated size 2 min_size 1 crush_ruleset 1
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 1493 lfor 115 flags
> hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
> 343597383680 hit_set bloom{false_positive_probability: 0.05, target_size:
> 0, seed: 0} 0s x0 decay_rate 0 search_last_n 0 stripe_width 0
> pool 12 'cephfs_data_ssd' replicated size 2 min_size 1 crush_ruleset 2
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 653 flags
> hashpspool stripe_width 0
>
> ~# ceph osd tree
> ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT
> PRIMARY-AFFINITY
>  -8  0.17598 root default-ssd
>
>  -9  0.09299 host bhs1-mail03-fe01-data
>
>  17  0.09299 osd.17  up  1.0
>  1.0
> -10  0.08299 host bhs1-mail03-fe02-data
>
>  18  0.08299 osd.18  up  1.0
>  1.0
>  -7  0.86319 root cache-ssd
>
>  -5  0.43159 host bhs1-mail03-fe01
>
>  15  0.43159 osd.15  up  1.0
>  1.0
>  -6  0.43159 host bhs1-mail03-fe02
>
>  16  0.43159 osd.16  up  1.0
>  1.0
>  -1 79.95895 root default
>
>  -2 26.65298 host bhs1-mail03-ds01
>
>   0  5.33060 osd.0   up  1.0
>  1.0
>   1  5.33060 osd.1   up  1.0
>  1.0
>   2  5.33060 osd.2   up  1.0
>  1.0
>   3  5.33060 osd.3   up  1.0
>  1.0
>   4  5.33060 osd.4   up  1.0
>  1.0
>  -3 26.65298 host bhs1-mail03-ds02
>
>   5  5.33060 osd.5   up  1.0
>  1.0
>   6  5.33060 osd.6   up  1.0
>  1.0
>   7  5.33060 osd.7   up  1.0
>  1.0
>   8  5.33060 osd.8   up  1.0
>  1.0
>   9  5.33060 osd.9   up  1.0
>  1.0
>  -4 26.65298 host bhs1-mail03-ds03
>
>  10  5.33060 osd.10  up  1.0
>  1.0
>  12  5.33060 osd.12  up  1.0
>  1.0
>  13  5.33060 osd.13  up  1.0
>  1.0
>  14  5.33060 osd.14  up  1.0
>  1.0
>  19  5.33060 osd.19  up  1.0
>  1.0
>
> ~# ceph osd crush rule dump
> [
> {
> "rule_id": 0,
> "rule_name": "replicated_ruleset",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> 

[ceph-users] ceph osd disk full (partition 100% used)

2017-10-11 Thread Webert de Souza Lima
Hi,

I have a cephfs cluster as follows:

1 15x HDD data pool (primary cephfs data pool)
1 2x SSD data pool (linked to a specific dir via xattrs)
1 2x SSD metadata pool
1 2x SSD cache tier pool

the cache tier pool consists in 2 host, with one SSD OSD on each host, with
size=2 replicated by host.
Last night the disks went 100% full and the cluster went down.

I know I made a mistake and set target_max_objects and target_max_bytes to
0 in the cache pool,
but isn't ceph supposed to stop writing to an OSD when it reaches
it's full_ratio (default 0.95) ?
And what about the cache_target_full_ratio in the cache tier pool?

Here is the cluster:

~# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
cephfs_data_ssd ]

* the Metadata and the SSD data pools use the same 2 OSDs (one cephfs
directory is linked to the SSD data pool via xattrs)

~# ceph -v
ceph version 10.2.9-4-gbeaec39 (beaec397f00491079cd74f7b9e3e10660859e26b)

~# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 136 lfor 115
flags hashpspool crash_replay_interval 45 tiers 3 read_tier 3 write_tier 3
stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 617 flags
hashpspool stripe_width 0
pool 3 'cephfs_cache' replicated size 2 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 128 pgp_num 128 last_change 1493 lfor 115 flags
hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
343597383680 hit_set bloom{false_positive_probability: 0.05, target_size:
0, seed: 0} 0s x0 decay_rate 0 search_last_n 0 stripe_width 0
pool 12 'cephfs_data_ssd' replicated size 2 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 653 flags
hashpspool stripe_width 0

~# ceph osd tree
ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT
PRIMARY-AFFINITY
 -8  0.17598 root default-ssd

 -9  0.09299 host bhs1-mail03-fe01-data

 17  0.09299 osd.17  up  1.0
 1.0
-10  0.08299 host bhs1-mail03-fe02-data

 18  0.08299 osd.18  up  1.0
 1.0
 -7  0.86319 root cache-ssd

 -5  0.43159 host bhs1-mail03-fe01

 15  0.43159 osd.15  up  1.0
 1.0
 -6  0.43159 host bhs1-mail03-fe02

 16  0.43159 osd.16  up  1.0
 1.0
 -1 79.95895 root default

 -2 26.65298 host bhs1-mail03-ds01

  0  5.33060 osd.0   up  1.0
 1.0
  1  5.33060 osd.1   up  1.0
 1.0
  2  5.33060 osd.2   up  1.0
 1.0
  3  5.33060 osd.3   up  1.0
 1.0
  4  5.33060 osd.4   up  1.0
 1.0
 -3 26.65298 host bhs1-mail03-ds02

  5  5.33060 osd.5   up  1.0
 1.0
  6  5.33060 osd.6   up  1.0
 1.0
  7  5.33060 osd.7   up  1.0
 1.0
  8  5.33060 osd.8   up  1.0
 1.0
  9  5.33060 osd.9   up  1.0
 1.0
 -4 26.65298 host bhs1-mail03-ds03

 10  5.33060 osd.10  up  1.0
 1.0
 12  5.33060 osd.12  up  1.0
 1.0
 13  5.33060 osd.13  up  1.0
 1.0
 14  5.33060 osd.14  up  1.0
 1.0
 19  5.33060 osd.19  up  1.0
 1.0

~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_ruleset_ssd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -7,
"item_name": "cache-ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "replicated-data-ssd",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -8,
  

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Jake Young
On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman  wrote:

> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
> wrote:
>
>> As far as I am able to understand there are 2 ways of setting iscsi for
>> ceph
>>
>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>>
>
> The target_core_rbd approach is only utilized by SUSE (and its derivatives
> like PetaSAN) as far as I know. This was the initial approach for Red
> Hat-derived kernels as well until the upstream kernel maintainers indicated
> that they really do not want a specialized target backend for just krbd.
> The next attempt was to re-use the existing target_core_iblock to interface
> with krbd via the kernel's block layer, but that hit similar upstream walls
> trying to get support for SCSI command passthrough to the block layer.
>
>
>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>>
>
> The TCMU approach is what upstream and Red Hat-derived kernels will
> support going forward.
>
> The lrbd project was developed by SUSE to assist with configuring a
> cluster of iSCSI gateways via the cli.  The ceph-iscsi-config +
> ceph-iscsi-cli projects are similar in goal but take a slightly different
> approach. ceph-iscsi-config provides a set of common Python libraries that
> can be re-used by ceph-iscsi-cli and ceph-ansible for deploying and
> configuring the gateway. The ceph-iscsi-cli project provides the gwcli tool
> which acts as a cluster-aware replacement for targetcli.
>
> I don't know which one is better, I am seeing that oficial support is
>> pointing to tcmu but i havent done any testbench.
>>
>
> We (upstream Ceph) provide documentation for the TCMU approach because
> that is what is available against generic upstream kernels (starting with
> 4.14 when it's out). Since it uses librbd (which still needs to undergo
> some performance improvements) instead of krbd, we know that librbd 4k IO
> performance is slower compared to krbd, but 64k and 128k IO performance is
> comparable. However, I think most iSCSI tuning guides would already tell
> you to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks).
>
>
>> Does anyone tried both? Do they give the same output? Are both able to
>> manage multiple iscsi targets mapped to a single rbd disk?
>>
>
> Assuming you mean multiple portals mapped to the same RBD disk, the answer
> is yes, both approaches should support ALUA. The ceph-iscsi-config tooling
> will only configure Active/Passive because we believe there are certain
> edge conditions that could result in data corruption if configured for
> Active/Active ALUA.
>
> The TCMU approach also does not currently support SCSI persistent
> reservation groups (needed for Windows clustering) because that support
> isn't available in the upstream kernel. The SUSE kernel has an approach
> that utilizes two round-trips to the OSDs for each IO to simulate PGR
> support. Earlier this summer I believe SUSE started to look into how to get
> generic PGR support merged into the upstream kernel using corosync/dlm to
> synchronize the states between multiple nodes in the target. I am not sure
> of the current state of that work, but it would benefit all LIO targets
> when complete.
>
>
>> I will try to make my own testing but if anyone has tried in advance it
>> would be really helpful.
>>
>> --
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> --
>>
>>
>> 
>>  Libre
>> de virus. www.avast.com
>> 
>> <#m_7291678653307726003_m_7112777861777147567_m_2432837294105570265_m_4580024349895004366_m_-4947191068488210222_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
Thanks Jason!

You should cut and paste that answer into a blog post on ceph.com. It is a
great summary of where things stand.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size & hybrid OSD latency

2017-10-11 Thread David Turner
Christian is correct that min_size does not affect how many need to ACK the
write, it is responsible for how many copies need to be available for the
PG to be accessible.  This is where SSD journals for filestore and SSD
DB/WAL partitions come into play.  The write is considered ACK'd as soon as
the journal has received the write.

Additionally please keep in mind that a write to an SSD across a network is
not going to be as fast as the SSD statistics claim on their own.  You are
adding in network latency to the device.

On Tue, Oct 10, 2017 at 7:51 PM Christian Balzer  wrote:

>
> Hello,
>
> On Wed, 11 Oct 2017 00:05:26 +0200 Jack wrote:
>
> > Hi,
> >
> > I would like some information about the following
> >
> > Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs
> > My single pool has size=3, min_size=2
> >
> > For a write-only pattern, I thought I would get SSDs performance level,
> > because the write would be acked as soon as min_size OSDs acked
> >
> > But I am right ?
> >
> You're the 2nd person in very recent times to come up with that wrong
> conclusion about min_size.
>
> All writes have to be ACKed, the only time where hybrid stuff helps is to
> accelerate reads.
> Which is something that people like me at least have very little interest
> in as the writes need to be fast.
>
> Christian
>
> > (the same setup could involve some high latency OSDs, in the case of
> > country-level cluster)
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size & hybrid OSD latency

2017-10-11 Thread Reed Dier
Just for the sake of putting this in the public forum,

In theory, by placing the primary copy of the object on an SSD medium, and 
placing replica copies on HDD medium, it should still yield some improvement in 
writes, compared to an all HDD scenario.

My logic here is rooted in the idea that the first copy requires a write, ACK, 
and then a read to send a copy to the replicas.
So instead of a slow write, and a slow read on your first hop, you have a fast 
write and fast read on the first hop, before pushing out to the slower second 
hop of 2x slow writes and ACKs.
Doubly so, if you have active io on the cluster, the SSD is taking all of the 
read io away from the slow HDDs, freeing up iops on the HDDs, which in turn 
should clear write ops quicker.

Please poke holes in this if you can.

Hopefully this will be useful for someone searching the ML.

Thanks,

Reed


> On Oct 10, 2017, at 6:50 PM, Christian Balzer  wrote:
> 
> All writes have to be ACKed, the only time where hybrid stuff helps is to
> accelerate reads.
> Which is something that people like me at least have very little interest
> in as the writes need to be fast. 
> 
> Christian
> 
>> (the same setup could involve some high latency OSDs, in the case of
>> country-level cluster)
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] general protection fault: 0000 [#1] SMP

2017-10-11 Thread Olivier Bonvalet
Hi,

I had a "general protection fault: " with Ceph RBD kernel client.
Not sure how to read the call, is it Ceph related ?


Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault:  
[#1] SMP
Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid 
binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev 
br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph 
xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul 
ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel 
aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper 
intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt 
fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter 
ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables 
x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor xor async_tx raid6_pq
Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 
multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core 
xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp 
usb_common pps_core mdio
Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: 
kworker/1:0 Not tainted 4.13-dae-dom0 #2
Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel 
Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 
10/12/2016
Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.898355] task: 8801ce434280 
task.stack: c900151bc000
Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:c900151bfac0 
EFLAGS: 00010202
Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: 8801b63df000 RBX: 
88021b41be00 RCX: 04df
Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 04df RSI: 
4450736e24806564 RDI: 8801b63df000
Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ea0005fdd8c8 R08: 
88028545d618 R09: 0010
Oct 11 16:15:11 lorunde kernel: [311418.902104] R10:  R11: 
880215815000 R12: 
Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: 8802158156c0 R14: 
 R15: 8801ce434280
Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  () 
GS:88028544() knlGS:88028544
Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS:  ES:  
CR0: 80050033
Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 55a8461cfc20 CR3: 
01809000 CR4: 00042660
Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? 
__sk_mem_raise_allocated+0x4e/0x220
Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? 
__tcp_push_pending_frames+0x28/0x90
Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? 
ceph_osdc_start_request+0x23/0x30 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? 
rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? 
kthread_create_on_node+0x40/0x40
Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 
1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 
44 00 00 48 89 f8 48 89 d1  a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 
72 7e 40 38 
Oct 11 16:15:11 

Re: [ceph-users] advice on number of objects per OSD

2017-10-11 Thread David Turner
I've managed RBD cluster that had all of the RBDs configured to 1M objects
and filled up the cluster to 75% full with 4TB drives.  Other than the
collection splitting (subfolder splitting as I've called it before) we
didn't have any problems with object counts.

On Wed, Oct 11, 2017 at 9:47 AM Gregory Farnum  wrote:

> These limits unfortunately aren’t very well understood or studied right
> now. The biggest slowdown I’m aware of is that when using FileStore you see
> an impact as it starts to create more folders internally (this is the
> “collection splitting”) and require more cached metadata to do fast lookups.
>
> But that doesn’t apply in the same ways to BlueStore, which shouldn’t have
> any of those cliff edges that I’m aware of. :)
> -Greg
> On Tue, Oct 10, 2017 at 3:45 AM Alexander Kushnirenko <
> kushnire...@gmail.com> wrote:
>
>> Hi,
>>
>> Are there any recommendations on what is the limit when osd performance
>> start to decline because of large number of objects? Or perhaps a procedure
>> on how to find this number (luminous)?  My understanding is that the
>> recommended object size is 10-100 MB, but is there any performance hit due
>> to large number of objects?  I ran across a number of about 1M objects, is
>> that so?  We do not have special SSD for journal and use librados for I/O.
>>
>> Alexander.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-10-11 Thread Alejandro Comisario
David, thanks.
I've switched the brnach to Luminous and the doc is the same (thankfully).

No worries, i'll wait till someone that hopefully did it already might give
me a hint.
thanks!

On Wed, Oct 11, 2017 at 11:00 AM, David Turner 
wrote:

> Careful when you're looking at documentation.  You're looking at the
> master branch which might have unreleased features or changes that your
> release doesn't have.  You'll want to change master in the url to luminous
> to make sure that you're looking at the documentation for your version of
> Ceph.
>
> I haven't personally used bluestore yet so I can't say what the proper
> commands are there without just looking online for the answer.  I do know
> that there is no reason to have your DB and WAL devices on separate
> partitions if they're on the same device.  What's been mentioned on the ML
> is that you want to create a partition for the DB and the WAL will use it.
> A partition for the WAL is only if it is planned to be on a different
> device than the DB.
>
> On Tue, Oct 10, 2017 at 5:59 PM Alejandro Comisario 
> wrote:
>
>> Hi, i see some notes there that did'nt existed on jewel :
>>
>> http://docs.ceph.com/docs/master/rados/operations/add-
>> or-rm-osds/#replacing-an-osd
>>
>> In my case what im using right now on that OSD is this :
>>
>> root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104
>> total 64K
>>0 drwxr-xr-x  2 ceph ceph  310 Sep 21 10:56 .
>> 4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 ..
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block ->
>> /dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.db ->
>> /dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.wal ->
>> /dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485
>>
>> block.db and block.wal are on two different NVME partitions, witch are 
>> nvme1n1p17
>> and nvme1n1p18 so assuming after hot swaping the device, the drive
>> letter is "sdx" according to the link above what would be the right command
>> to re-use the two NVME partitions for block db and wal ?
>>
>> I presume that everything else is the same.
>> best.
>>
>>
>> On Sat, Sep 30, 2017 at 9:00 PM, David Turner 
>> wrote:
>>
>>> I'm pretty sure that the process is the same as with filestore. The
>>> cluster doesn't really know if an osd is filestore or bluestore... It's
>>> just an osd running a daemon.
>>>
>>> If there are any differences, they would be in the release notes for
>>> Luminous as changes from Jewel.
>>>
>>> On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario 
>>> wrote:
>>>
 Hi all.
 Independetly that i've deployerd a ceph Luminous cluster with Bluestore
 using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the
 right way to replace a disk when using Bluestore ?

 I will try to forget everything i know on how to recover things with
 filestore and start fresh.

 Any how-to's ? experiences ? i dont seem to find an official way of
 doing it.
 best.

 --
 *Alejandro Comisario*
 *CTO | NUBELIU*
 E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
 <+54%209%2011%203770-1857>
 _
 www.nubeliu.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-10-11 Thread David Turner
Careful when you're looking at documentation.  You're looking at the master
branch which might have unreleased features or changes that your release
doesn't have.  You'll want to change master in the url to luminous to make
sure that you're looking at the documentation for your version of Ceph.

I haven't personally used bluestore yet so I can't say what the proper
commands are there without just looking online for the answer.  I do know
that there is no reason to have your DB and WAL devices on separate
partitions if they're on the same device.  What's been mentioned on the ML
is that you want to create a partition for the DB and the WAL will use it.
A partition for the WAL is only if it is planned to be on a different
device than the DB.

On Tue, Oct 10, 2017 at 5:59 PM Alejandro Comisario 
wrote:

> Hi, i see some notes there that did'nt existed on jewel :
>
>
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>
> In my case what im using right now on that OSD is this :
>
> root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104
> total 64K
>0 drwxr-xr-x  2 ceph ceph  310 Sep 21 10:56 .
> 4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 ..
>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block ->
> /dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1
>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.db ->
> /dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a
>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.wal ->
> /dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485
>
> block.db and block.wal are on two different NVME partitions, witch are 
> nvme1n1p17
> and nvme1n1p18 so assuming after hot swaping the device, the drive letter
> is "sdx" according to the link above what would be the right command to
> re-use the two NVME partitions for block db and wal ?
>
> I presume that everything else is the same.
> best.
>
>
> On Sat, Sep 30, 2017 at 9:00 PM, David Turner 
> wrote:
>
>> I'm pretty sure that the process is the same as with filestore. The
>> cluster doesn't really know if an osd is filestore or bluestore... It's
>> just an osd running a daemon.
>>
>> If there are any differences, they would be in the release notes for
>> Luminous as changes from Jewel.
>>
>> On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario 
>> wrote:
>>
>>> Hi all.
>>> Independetly that i've deployerd a ceph Luminous cluster with Bluestore
>>> using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the
>>> right way to replace a disk when using Bluestore ?
>>>
>>> I will try to forget everything i know on how to recover things with
>>> filestore and start fresh.
>>>
>>> Any how-to's ? experiences ? i dont seem to find an official way of
>>> doing it.
>>> best.
>>>
>>> --
>>> *Alejandro Comisario*
>>> *CTO | NUBELIU*
>>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>>> <+54%209%2011%203770-1857>
>>> _
>>> www.nubeliu.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
> _
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] advice on number of objects per OSD

2017-10-11 Thread Gregory Farnum
These limits unfortunately aren’t very well understood or studied right
now. The biggest slowdown I’m aware of is that when using FileStore you see
an impact as it starts to create more folders internally (this is the
“collection splitting”) and require more cached metadata to do fast lookups.

But that doesn’t apply in the same ways to BlueStore, which shouldn’t have
any of those cliff edges that I’m aware of. :)
-Greg
On Tue, Oct 10, 2017 at 3:45 AM Alexander Kushnirenko 
wrote:

> Hi,
>
> Are there any recommendations on what is the limit when osd performance
> start to decline because of large number of objects? Or perhaps a procedure
> on how to find this number (luminous)?  My understanding is that the
> recommended object size is 10-100 MB, but is there any performance hit due
> to large number of objects?  I ran across a number of about 1M objects, is
> that so?  We do not have special SSD for journal and use librados for I/O.
>
> Alexander.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Mark Nelson

Hi Jorge,

I was sort of responsible for all of this. :)

So basically there are different caches in different places:

- rocksdb bloom filter and index cache
- rocksdb block cache (which can be configured to include filters and 
indexes)

- rocksdb compressed block cache
- bluestore onode cache

The bluestore onode cache is the only one that stores onode/extent/blob 
metadata before it is encoded, ie it's bigger but has lower impact on 
the CPU.  The next step is the regular rocksdb block cache where we've 
already encoded the data, but it's not compressed.  Optionally we could 
also compress the data and then cache it using rocksdb's compressed 
block cache.  Finally, rocksdb can set memory aside for bloom filters 
and indexes but we're configuring those to go into the block cache so we 
can get a better accounting for how memory is being used (otherwise it's 
difficult to control how much memory index and filters get).  The 
downside is that bloom filters and indexes can theoretically get paged 
out under heavy cache pressure.  We set these to be high priority in the 
block cache and also pin the L0 filters/index though to help avoid this.


In the testing I did earlier this year, what I saw is that in low memory 
scenarios it's almost always best to give all of the cache to rocksdb's 
block cache.  Once you hit about the 512MB mark, we start seeing bigger 
gains by giving additional memory to bluestore's onode cache.  So we 
devised a mechanism where you can decide where to cut over.  It's quite 
possible that on very fast CPUs it might make sense ot use rocksdb 
compressed cache, or possibly if you have a huge number of objects these 
ratios might change.  The values we have now were sort of the best 
jack-of-all-trades values we found.


Mark

On 10/11/2017 08:32 AM, Jorge Pinilla López wrote:

okay, thanks for the explanation, so from the 3GB of Cache (default
cache for SSD) only a 0.5GB is going to K/V and 2.5 going to metadata.

Is there a way of knowing how much k/v, metadata, data is storing and
how full cache is so I can adjust my ratios?, I was thinking some ratios
(like 0.9 k/v, 0.07 meta 0.03 data) but only speculating, I dont have
any real data.

El 11/10/2017 a las 14:32, Mohamad Gebai escribió:

Hi Jorge,

On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:

Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
too disproporcionate.

Yes, this is correct.


Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
be used for KV but there is also another attributed called
bluestore_cache_kv_max which is by fault 512MB, then what is the rest
of the cache used for?, nothing? shouldnt it be more kv_max value or
less KV ratio?

Anything over the *cache_kv_max value goes to the metadata cache. You
can look in your logs to see the final values of kv, metadata and data
cache ratios. To get data cache, you need to lower the ratios of
metadata and kv caches.

Mohamad


--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Jorge Pinilla López
okay, thanks for the explanation, so from the 3GB of Cache (default
cache for SSD) only a 0.5GB is going to K/V and 2.5 going to metadata.

Is there a way of knowing how much k/v, metadata, data is storing and
how full cache is so I can adjust my ratios?, I was thinking some ratios
(like 0.9 k/v, 0.07 meta 0.03 data) but only speculating, I dont have
any real data.

El 11/10/2017 a las 14:32, Mohamad Gebai escribió:
> Hi Jorge,
>
> On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:
>> Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
>> too disproporcionate.
> Yes, this is correct.
>
>> Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
>> be used for KV but there is also another attributed called
>> bluestore_cache_kv_max which is by fault 512MB, then what is the rest
>> of the cache used for?, nothing? shouldnt it be more kv_max value or
>> less KV ratio?
> Anything over the *cache_kv_max value goes to the metadata cache. You
> can look in your logs to see the final values of kv, metadata and data
> cache ratios. To get data cache, you need to lower the ratios of
> metadata and kv caches.
>
> Mohamad

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Jason Dillaman
On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
wrote:

> As far as I am able to understand there are 2 ways of setting iscsi for
> ceph
>
> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>

The target_core_rbd approach is only utilized by SUSE (and its derivatives
like PetaSAN) as far as I know. This was the initial approach for Red
Hat-derived kernels as well until the upstream kernel maintainers indicated
that they really do not want a specialized target backend for just krbd.
The next attempt was to re-use the existing target_core_iblock to interface
with krbd via the kernel's block layer, but that hit similar upstream walls
trying to get support for SCSI command passthrough to the block layer.


> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>

The TCMU approach is what upstream and Red Hat-derived kernels will support
going forward.

The lrbd project was developed by SUSE to assist with configuring a cluster
of iSCSI gateways via the cli.  The ceph-iscsi-config + ceph-iscsi-cli
projects are similar in goal but take a slightly different approach.
ceph-iscsi-config provides a set of common Python libraries that can be
re-used by ceph-iscsi-cli and ceph-ansible for deploying and configuring
the gateway. The ceph-iscsi-cli project provides the gwcli tool which acts
as a cluster-aware replacement for targetcli.

I don't know which one is better, I am seeing that oficial support is
> pointing to tcmu but i havent done any testbench.
>

We (upstream Ceph) provide documentation for the TCMU approach because that
is what is available against generic upstream kernels (starting with 4.14
when it's out). Since it uses librbd (which still needs to undergo some
performance improvements) instead of krbd, we know that librbd 4k IO
performance is slower compared to krbd, but 64k and 128k IO performance is
comparable. However, I think most iSCSI tuning guides would already tell
you to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks).


> Does anyone tried both? Do they give the same output? Are both able to
> manage multiple iscsi targets mapped to a single rbd disk?
>

Assuming you mean multiple portals mapped to the same RBD disk, the answer
is yes, both approaches should support ALUA. The ceph-iscsi-config tooling
will only configure Active/Passive because we believe there are certain
edge conditions that could result in data corruption if configured for
Active/Active ALUA.

The TCMU approach also does not currently support SCSI persistent
reservation groups (needed for Windows clustering) because that support
isn't available in the upstream kernel. The SUSE kernel has an approach
that utilizes two round-trips to the OSDs for each IO to simulate PGR
support. Earlier this summer I believe SUSE started to look into how to get
generic PGR support merged into the upstream kernel using corosync/dlm to
synchronize the states between multiple nodes in the target. I am not sure
of the current state of that work, but it would benefit all LIO targets
when complete.


> I will try to make my own testing but if anyone has tried in advance it
> would be really helpful.
>
> --
> *Jorge Pinilla López*
> jorp...@unizar.es
> --
>
>
> 
>  Libre
> de virus. www.avast.com
> 
> <#m_7112777861777147567_m_2432837294105570265_m_4580024349895004366_m_-4947191068488210222_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Mohamad Gebai
Hi Jorge,

On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:
> Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
> too disproporcionate.

Yes, this is correct.

> Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
> be used for KV but there is also another attributed called
> bluestore_cache_kv_max which is by fault 512MB, then what is the rest
> of the cache used for?, nothing? shouldnt it be more kv_max value or
> less KV ratio?

Anything over the *cache_kv_max value goes to the metadata cache. You
can look in your logs to see the final values of kv, metadata and data
cache ratios. To get data cache, you need to lower the ratios of
metadata and kv caches.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-11 Thread Alexander Kushnirenko
Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
stripe count is here:

https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf

I tried to attach it in the mail, but it was blocked.


On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko <
kushnire...@gmail.com> wrote:

> Hi, Ian!
>
> Thank you for your reference!
>
> Could you comment on the following rule:
> object_size = stripe_unit * stripe_count
> Or it is not necessarily so?
>
> I refer to page 8 in this report:
>
> https://indico.cern.ch/event/531810/contributions/2298934/at
> tachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
>
>
> Alexander.
>
> On Wed, Oct 11, 2017 at 1:11 PM,  wrote:
>
>> Hi Gregory
>>
>> You’re right, when setting the object layout in libradosstriper, one
>> should set all three parameters (the number of stripes, the size of the
>> stripe unit, and the size of the striped object). The Ceph plugin for
>> GridFTP has an example of this at https://github.com/stfc/gridFT
>> PCephPlugin/blob/master/ceph_posix.cpp#L371
>>
>>
>>
>> At RAL, we use the following values:
>>
>>
>>
>> $STRIPER_NUM_STRIPES 1
>>
>> $STRIPER_STRIPE_UNIT 8388608
>>
>> $STRIPER_OBJECT_SIZE 67108864
>>
>>
>>
>> Regards,
>>
>>
>>
>> Ian Johnson MBCS
>>
>> Data Services Group
>>
>> Scientific Computing Department
>>
>> Rutherford Appleton Laboratory
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-11 Thread Alexander Kushnirenko
Hi, Ian!

Thank you for your reference!

Could you comment on the following rule:
object_size = stripe_unit * stripe_count
Or it is not necessarily so?

I refer to page 8 in this report:

https://indico.cern.ch/event/531810/contributions/2298934/at
tachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf


Alexander.

On Wed, Oct 11, 2017 at 1:11 PM,  wrote:

> Hi Gregory
>
> You’re right, when setting the object layout in libradosstriper, one
> should set all three parameters (the number of stripes, the size of the
> stripe unit, and the size of the striped object). The Ceph plugin for
> GridFTP has an example of this at https://github.com/stfc/gridFT
> PCephPlugin/blob/master/ceph_posix.cpp#L371
>
>
>
> At RAL, we use the following values:
>
>
>
> $STRIPER_NUM_STRIPES 1
>
> $STRIPER_STRIPE_UNIT 8388608
>
> $STRIPER_OBJECT_SIZE 67108864
>
>
>
> Regards,
>
>
>
> Ian Johnson MBCS
>
> Data Services Group
>
> Scientific Computing Department
>
> Rutherford Appleton Laboratory
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-11 Thread Alexander Kushnirenko
Hi, Gregory!

You are absolutely right! Thanks!

The following sequence solves the problem:
rados_striper_set_object_layout_stripe_unit(m_striper, stripe_unit);
rados_striper_set_object_layout_stripe_count(m_striper, stripe_count);
int stripe_size = stripe_unit * stripe_count;
rados_striper_set_object_layout_object_size(m_striper, stripe_size);

Now there is very little in the documentation about meaning of above
parameters.  The only document I found is CERN IT group presentation (page
8).  Perhaps it is obvious.  Also it seems that optimizing these parameters
is meaningful in large scale Ceph installations only.

https://indico.cern.ch/event/531810/contributions/2298934/
attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf

Now if I have a 6TB disks, then in default installation there would be
6TB/4MB = 1.5M objects per OSD.  Does it create any performance hit?

Thank you,
Alexander

On Tue, Oct 10, 2017 at 12:38 AM, Gregory Farnum  wrote:

> Well, just from a quick skim, libradosstriper.h has a function
> rados_striper_set_object_layout_object_size(rados_striper_t striper,
> unsigned int object_size)
> and libradosstriper.hpp has one in RadosStriper
> set_object_layout_object_size(unsigned int object_size);
>
> So I imagine you specify it with those the same way you've set the stripe
> unit and counts.
>
> On Sat, Oct 7, 2017 at 12:38 PM Alexander Kushnirenko <
> kushnire...@gmail.com> wrote:
>
>> Hi, Gregory!
>>
>> It turns out that this error is internal CEPH feature. I wrote standalone
>> program to create 132M object in striper mode. It works only for 4M
>> stripe.  If you set stripe_unit = 2M it still creates 4M stripe_unit.
>> Anything bigger than 4M causes crash here
>> :
>>
>>
>> __u32 object_size = layout->object_size;
>>   __u32 su = layout->stripe_unit;
>>   __u32 stripe_count = layout->stripe_count;
>>   assert(object_size >= su);   <
>>
>> I'm curious where it gets layout->object_size for object that is just
>> been created.
>>
>> As I understod striper mode was created by CERN guys.  In there document
>> 
>> they recommend 8M stripe_unit.  But it does not work in luminous.
>>
>> Created I/O context.
>> Connected to pool backup with rados_striper_create
>> Stripe unit OK 8388608
>> Stripe count OK 1
>> /build/ceph-12.2.0/src/osdc/Striper.cc: In function 'static void
>> Striper::file_to_extents(CephContext*, const char*, const
>> file_layout_t*, uint64_t, uint64_t, uint64_t, std::map> std::vector >&, uint64_t)' thread 7f13bd5c1e00 time
>> 2017-10-07 21:44:58.654778
>> /build/ceph-12.2.0/src/osdc/Striper.cc: 64: FAILED assert(object_size >=
>> su)
>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
>> (rc)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x102) [0x7f13b3f3b332]
>>  2: (Striper::file_to_extents(CephContext*, char const*, file_layout_t
>> const*, unsigned long, unsigned long, unsigned long, std::map> std::vector,
>> std::less, std::allocator std::vector > > >&, unsigned
>> long)+0x1e1e) [0x7f13bce235ee]
>>  3: (Striper::file_to_extents(CephContext*, char const*, file_layout_t
>> const*, unsigned long, unsigned long, unsigned long,
>> std::vector&, unsigned
>> long)+0x51) [0x7f13bce23691]
>>  4: (libradosstriper::RadosStriperImpl::internal_
>> aio_write(std::__cxx11::basic_string> std::allocator > const&, 
>> boost::intrusive_ptr,
>> ceph::buffer::list const&, unsigned long, unsigned long, ceph_file_layout
>> const&)+0x224) [0x7f13bcda4184]
>>  5: (libradosstriper::RadosStriperImpl::write_in_
>> open_object(std::__cxx11::basic_string> std::allocator > const&, ceph_file_layout const&,
>> std::__cxx11::basic_string> std::allocator > const&, ceph::buffer::list const&, unsigned long,
>> unsigned long)+0x13c) [0x7f13bcda476c]
>>  6: 
>> (libradosstriper::RadosStriperImpl::write(std::__cxx11::basic_string> std::char_traits, std::allocator > const&, ceph::buffer::list
>> const&, unsigned long, unsigned long)+0xd5) [0x7f13bcda4bd5]
>>  7: (rados_striper_write()+0xdb) [0x7f13bcd9ba0b]
>>  8: (()+0x10fb) [0x55dd87b410fb]
>>  9: (__libc_start_main()+0xf1) [0x7f13bc9d72b1]
>>  10: (()+0xbca) [0x55dd87b40bca]
>>
>>
>> On Fri, Sep 29, 2017 at 11:46 PM, Gregory Farnum 
>> wrote:
>>
>>> I haven't used the striper, but it appears to make you specify sizes,
>>> stripe units, and stripe counts. I would expect you need to make sure that
>>> the size is an integer multiple of the stripe unit. And it probably
>>> defaults to a 4MB object if you don't specify one?
>>>
>>> On Fri, Sep 29, 2017 at 

Re: [ceph-users] assertion error trying to start mds server

2017-10-11 Thread John Spray
On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer  wrote:
> I've been in the process of updating my gentoo based cluster both with
> new hardware and a somewhat postponed update.  This includes some major
> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
> and using gcc 6.4.0 to make better use of AMD Ryzen on the new
> hardware.  The existing cluster was on 10.2.2, but I was going to
> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
> transitioning to bluestore on the osd's.
>
> The Ryzen units are slated to be bluestore based OSD servers if and when
> I get to that point.  Up until the mds failure, they were simply cephfs
> clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
> MON) and had two servers left to update.  Both of these are also MONs
> and were acting as a pair of dual active MDS servers running 10.2.2.
> Monday morning I found out the hard way that an UPS one of them was on
> has a dead battery.  After I fsck'd and came back up, I saw the
> following assertion error when it was trying to start it's mds.B server:
>
>
>  mdsbeacon(64162/B up:replay seq 3 v4699) v7  126+0+0 (709014160
> 0 0) 0x7f6fb4001bc0 con 0x55f94779d
> 8d0
>  0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
> function 'virtual void EImportStart::r
> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x82) [0x55f93d64a122]
>  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
>  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
>  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
>  5: (()+0x74a4) [0x7f6fd009b4a4]
>  6: (clone()+0x6d) [0x7f6fce5a598d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_mirror
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>0/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>1/ 5 paxos
>0/ 5 tp
>1/ 5 auth
>1/ 5 crypto
>1/ 1 finisher
>1/ 5 heartbeatmap
>1/ 5 perfcounter
>1/ 5 rgw
>1/10 civetweb
>1/ 5 javaclient
>1/ 5 asok
>1/ 1 throttle
>0/ 0 refs
>1/ 5 xio
>1/ 5 compressor
>1/ 5 newstore
>1/ 5 bluestore
>1/ 5 bluefs
>1/ 3 bdev
>1/ 5 kstore
>4/ 5 rocksdb
>4/ 5 leveldb
>1/ 5 kinetic
>1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent 1
>   max_new 1000
>   log_file /var/log/ceph/ceph-mds.B.log
>
>
>
> When I was googling around, I ran into this Cern presentation and tried
> out the offline backware scrubbing commands on slide 25 first:
>
> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf
>
>
> Both ran without any messages, so I'm assuming I have sane contents in
> the cephfs_data and cephfs_metadata pools.  Still no luck getting things
> restarted, so I tried the cephfs-journal-tool journal reset on slide
> 23.  That didn't work either.  Just for giggles, I tried setting up the
> two Ryzen boxes as new mds.C and mds.D servers which would run on
> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
> with the same assert as follows:


Because this system was running multiple active MDSs on Jewel (based
on seeing an EImportStart journal entry), and that was known to be
unstable, I would advise you to blow away the filesystem and create a
fresh one using luminous (where multi-mds is stable), rather than
trying to debug it.  Going back to try and work out what went wrong
with Jewel code is probably not a very valuable activity unless you
have irreplacable data.

If you do want to get this filesystem back on its feet in-place:
(first stopping all MDSs) I'm guessing that your cephfs-journal-tool
reset didn't help because you had multiple MDS ranks, and that tool
just operates on rank 0 by default.  You need to work out which rank's
journal is actually damaged (it's part of the prefix to MDS log
messages), and then pass a --rank argument to cephfs-journal-tool.
You will also need to reset all the other ranks' journals to keep
things consistent, and then do a "ceph fs reset" so that it will start
up with a single MDS next time.  If you get the filesystem up and
running again, I'd still recommend copying anything important off it
and 

[ceph-users] Ceph-ISCSI

2017-10-11 Thread Jorge Pinilla López
As far as I am able to understand there are 2 ways of setting iscsi for ceph

1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)

I don't know which one is better, I am seeing that oficial support is
pointing to tcmu but i havent done any testbench.
Does anyone tried both? Do they give the same output? Are both able to
manage multiple iscsi targets mapped to a single rbd disk?

I will try to make my own testing but if anyone has tried in advance it
would be really helpful.


*Jorge Pinilla López*
jorp...@unizar.es



---
El software de antivirus Avast ha analizado este correo electrónico en busca de 
virus.
https://www.avast.com/antivirus
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-11 Thread ian.johnson
Hi Gregory
You're right, when setting the object layout in libradosstriper, one should set 
all three parameters (the number of stripes, the size of the stripe unit, and 
the size of the striped object). The Ceph plugin for GridFTP has an example of 
this at 
https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371

At RAL, we use the following values:

$STRIPER_NUM_STRIPES 1
$STRIPER_STRIPE_UNIT 8388608
$STRIPER_OBJECT_SIZE 67108864

Regards,

Ian Johnson MBCS
Data Services Group
Scientific Computing Department
Rutherford Appleton Laboratory

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A new SSD for journals - everything sucks?

2017-10-11 Thread Piotr Dałek

On 17-10-11 09:50 AM, Josef Zelenka wrote:

Hello everyone,
lately, we've had issues with buying SSDs that we use for 
journaling(Kingston stopped making them) - Kingston V300 - so we decided to 
start using a different model and started researching which one would be the 
best price/value for us. We compared five models, to check if they are 
compatible with our needs - SSDNow v300, HyperX Fury,SSDNOw KC400, SSDNow 
UV400 and SSDNow A400. the best one is still the V300, with the highest iops 
of 59 001. Second best and still useable was the HyperX Fury with 45000 
iops.  The other three had terrible results, the max iops we got were around 
13 000 with the dsync and direct flags. We also tested Samsung SSDs(the EVO 
series) and we got similarly bad results. To get to the root of my question 
- i am pretty sure we are not the only ones affected by the v300's death. Is 
there anyone else out there with some benchmarking data/knowledge about some 
good price/performance SSDs for ceph journaling? I can also share the 
complete benchmarking data my coworker made, if someone is interested.


Never, absolutely never pick consumer-grade SSDs for Ceph cluster, and in 
particular - never pick a drive with low TBW for journal. Ceph is going to 
kill it within a few months. Besides, consumer-grade drives are not 
optimized for Ceph-like/enterprise workloads, resulting in weird performance 
characteristics, like tens of thousands of IOPS for a first few seconds, 
then dropping to 1K IOPS (typical for drives with TLC NAND and SLC NAND 
cache), or performing reasonably till some write queue depth is hit, then 
degrading badly (underperforming controller), or killing your OSD journals 
on power failure (no BBU or capacitors to power the drive while flushing 
when PSU goes down).


You may want to look at this:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] A new SSD for journals - everything sucks?

2017-10-11 Thread Josef Zelenka

Hello everyone,
lately, we've had issues with buying SSDs that we use for 
journaling(Kingston stopped making them) - Kingston V300 - so we decided 
to start using a different model and started researching which one would 
be the best price/value for us. We compared five models, to check if 
they are compatible with our needs - SSDNow v300, HyperX Fury,SSDNOw 
KC400, SSDNow UV400 and SSDNow A400. the best one is still the V300, 
with the highest iops of 59 001. Second best and still useable was the 
HyperX Fury with 45000 iops.  The other three had terrible results, the 
max iops we got were around 13 000 with the dsync and direct flags. We 
also tested Samsung SSDs(the EVO series) and we got similarly bad 
results. To get to the root of my question - i am pretty sure we are not 
the only ones affected by the v300's death. Is there anyone else out 
there with some benchmarking data/knowledge about some good 
price/performance SSDs for ceph journaling? I can also share the 
complete benchmarking data my coworker made, if someone is interested.

Thanks
Josef Zelenka
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-11 Thread Konrad Riedel

Thanks a lot - problem fixed.


On 10.10.2017 16:58, Peter Linder wrote:

I think your failure domain within your rules is wrong.

step choose firstn 0 type osd

Should be:

step choose firstn 0 type host


On 10/10/2017 5:05 PM, Konrad Riedel wrote:

Hello Ceph-users,

after switching to luminous I was excited about the great
crush-device-class feature - now we have 5 servers with 1x2TB NVMe
based OSDs, 3 of them additionally with 4 HDDS per server. (we have
only three 400G NVMe disks for block.wal and block.db and therefore
can't distribute all HDDs evenly on all servers.)

Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on
the same
Host:

ceph pg map 5.b
osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]

(on rebooting this host I had 4 stale PGs)

I've written a small perl script to add hostname after OSD number and
got many PGs where
ceph placed 2 replicas on the same host... :

5.1e7: 8 - daniel 9 - daniel 11 - udo
5.1eb: 10 - udo 7 - daniel 9 - daniel
5.1ec: 10 - udo 11 - udo 7 - daniel
5.1ed: 13 - felix 16 - felix 5 - udo


Is there any way I can correct this?


Please see crushmap below. Thanks for any help!

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 device1
device 2 osd.2 class ssd
device 3 device3
device 4 device4
device 5 osd.5 class hdd
device 6 device6
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 device15
device 16 osd.16 class hdd
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 osd.24 class hdd
device 25 device25
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class ssd
device 31 osd.31 class ssd
device 32 osd.32 class ssd
device 33 osd.33 class ssd

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host daniel {
 id -4    # do not change unnecessarily
 id -2 class hdd    # do not change unnecessarily
 id -9 class ssd    # do not change unnecessarily
 # weight 3.459
 alg straw2
 hash 0    # rjenkins1
 item osd.31 weight 1.819
 item osd.7 weight 0.547
 item osd.8 weight 0.547
 item osd.9 weight 0.547
}
host felix {
 id -5    # do not change unnecessarily
 id -3 class hdd    # do not change unnecessarily
 id -10 class ssd    # do not change unnecessarily
 # weight 3.653
 alg straw2
 hash 0    # rjenkins1
 item osd.33 weight 1.819
 item osd.13 weight 0.547
 item osd.14 weight 0.467
 item osd.16 weight 0.547
 item osd.0 weight 0.274
}
host udo {
 id -6    # do not change unnecessarily
 id -7 class hdd    # do not change unnecessarily
 id -11 class ssd    # do not change unnecessarily
 # weight 4.006
 alg straw2
 hash 0    # rjenkins1
 item osd.32 weight 1.819
 item osd.5 weight 0.547
 item osd.10 weight 0.547
 item osd.11 weight 0.547
 item osd.12 weight 0.547
}
host moritz {
 id -13    # do not change unnecessarily
 id -14 class hdd    # do not change unnecessarily
 id -15 class ssd    # do not change unnecessarily
 # weight 1.819
 alg straw2
 hash 0    # rjenkins1
 item osd.30 weight 1.819
}
host bruno {
 id -16    # do not change unnecessarily
 id -17 class hdd    # do not change unnecessarily
 id -18 class ssd    # do not change unnecessarily
 # weight 3.183
 alg straw2
 hash 0    # rjenkins1
 item osd.24 weight 0.273
 item osd.26 weight 0.273
 item osd.27 weight 0.273
 item osd.28 weight 0.273
 item osd.29 weight 0.273
 item osd.2 weight 1.819
}
root default {
 id -1    # do not change unnecessarily
 id -8 class hdd    # do not change unnecessarily
 id -12 class ssd    # do not change unnecessarily
 # weight 16.121
 alg straw2
 hash 0    # rjenkins1
 item daniel weight 3.459
 item felix weight 3.653
 item udo weight 4.006
 item moritz weight 1.819
 item bruno weight 3.183
}

# rules
rule ssd {
 id 0
 type replicated
 min_size 1
 max_size 10
 step take default class ssd
 step choose firstn 0 type osd
 step emit
}
rule hdd {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take default class hdd
 step choose firstn 0 type osd
 step emit
}

# end crush map



--

Mit freundlichen Grüßen

Konrad Riedel

--

Berufsförderungswerk Dresden