[ceph-users] mimic stability finally achieved

2019-04-11 Thread Mazzystr
I think I finally have a stable containerized mimic cluster... je
zez!  It was hard enough!

I'm currently repopulating cephfs and cruising along at ...
client:   147 MiB/s wr, 0 op/s rd, 38 op/s wr

First last month I had four Seagate Barracuda drive failures at the same
time with around 18,000 power_on_hours.  I used to adore Seagate.  Now I am
shocked at how terrible they are.  And they have been advertising NVMe's to
me on Facebook very heavily.  There is no way... Then three newegg and
Amazon suppliers failed to get me correct HGST disks.  I've been dealing
with this garbage for like five weeks now!

Once the hardware issues were settled my osds were flapping like gophers on
the caddy shack golf course.  Ultimately I had to copy data to standalone
drives, destroy everything ceph related and start from scratch.  That
didn't solve the problem!  osds continued to flap!

For new clusters and maybe old cluster also this setting is key!!!
ceph osd crush tunables optimal


Without that setting I surmise that client writes would consume all
available cluster bandwidth.  MDS would report slow IOs.  OSD would not be
able to replicate object or answer heartbeats then slowly they would get
knocked out of the cluster.

That's all from me.  Like my poor NC sinus's recovering from the crazy
pollen my ceph life is also slowly recovering.

/Chris C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-24 Thread Mazzystr
Hi Sage, Thanks for chiming in.  I can't image how busy you are.

Sorry guys.  I reprovisioned the offending osd right after this email and a
conversation on #ceph.  I do have the output from '/usr/bin/ceph daemon
osd.5 perf dump | /usr/bin/jq .' saved.  I'll be happy to add it to the
issue tracker.

/C


On Fri, Mar 22, 2019 at 7:01 PM Sage Weil  wrote:

> I have a ticket open for this:
>
> http://tracker.ceph.com/issues/38745
>
> Please comment there with the health warning you're seeing and any other
> details so we can figure out why it's happening.
>
> I wouldn't reprovision those OSDs yet, until we know why it happens.
> Also, it's likely that ceph-bluestore-tool can sort it out be
> adding/removing the db volume.
>
> Thanks!
> sage
>
>
> On Fri, 22 Mar 2019, Mazzystr wrote:
>
> > I am also seeing BlueFS spill since updating to Nautilus.  I also see
> high
> > slow_used_bytes and slow_total_bytes metrics.  It sure looks to me that
> the
> > only solution is to zap and rebuilt the osd.  I had to manually check 36
> > osds some of them traditional processes and some containerized.  The lack
> > of tooling here is underwhelming...  As soon as I rebuilt the osd the
> > "BlueFS spill..." warning went away.
> >
> > I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I
> don't
> > understand the spillover.
> >
> >
> > On Fri, Feb 15, 2019 at 12:33 PM David Turner 
> wrote:
> >
> > > The answer is probably going to be in how big your DB partition is vs
> how
> > > big your HDD disk is.  From your output it looks like you have a 6TB
> HDD
> > > with a 28GB Blocks.DB partition.  Even though the DB used size isn't
> > > currently full, I would guess that at some point since this OSD was
> created
> > > that it did fill up and what you're seeing is the part of the DB that
> > > spilled over to the data disk.  This is why the official recommendation
> > > (that is quite cautious, but cautious because some use cases will use
> this
> > > up) for a blocks.db partition is 4% of the data drive.  For your 6TB
> disks
> > > that's a recommendation of 240GB per DB partition.  Of course the
> actual
> > > size of the DB needed is dependent on your use case.  But pretty much
> every
> > > use case for a 6TB disk needs a bigger partition than 28GB.
> > >
> > > On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin 
> > > wrote:
> > >
> > >> Wrong metadata paste of osd.73 in previous message.
> > >>
> > >>
> > >> {
> > >>
> > >>  "id": 73,
> > >>  "arch": "x86_64",
> > >>  "back_addr": "10.10.10.6:6804/175338",
> > >>  "back_iface": "vlan3",
> > >>  "bluefs": "1",
> > >>  "bluefs_db_access_mode": "blk",
> > >>  "bluefs_db_block_size": "4096",
> > >>  "bluefs_db_dev": "259:22",
> > >>  "bluefs_db_dev_node": "nvme2n1",
> > >>  "bluefs_db_driver": "KernelDevice",
> > >>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> > >>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
> > >>  "bluefs_db_rotational": "0",
> > >>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
> > >>  "bluefs_db_size": "30064771072",
> > >>  "bluefs_db_type": "nvme",
> > >>  "bluefs_single_shared_device": "0",
> > >>  "bluefs_slow_access_mode": "blk",
> > >>  "bluefs_slow_block_size": "4096",
> > >>  "bluefs_slow_dev": "8:176",
> > >>  "bluefs_slow_dev_node": "sdl",
> > >>  "bluefs_slow_driver": "KernelDevice",
> > >>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
> > >>  "bluefs_slow_partition_path": "/dev/sdl2",
> > >>  "bluefs_slow_rotational": "1",
> > >>  "bluefs_slow_size": "6001069199360",
> > >>  "bluefs_slow_type": "hdd",
> > >>  "bluefs_wal_access_mode": "blk&quo

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Mazzystr
inline...

On Fri, Mar 22, 2019 at 1:08 PM Konstantin Shalygin  wrote:

> On 3/22/19 11:57 PM, Mazzystr wrote:
> > I am also seeing BlueFS spill since updating to Nautilus.  I also see
> > high slow_used_bytes and slow_total_bytes metrics.  It sure looks to
> > me that the only solution is to zap and rebuilt the osd.  I had to
> > manually check 36 osds some of them traditional processes and some
> > containerized.  The lack of tooling here is underwhelming...  As soon
> > as I rebuilt the osd the "BlueFS spill..." warning went away.
> >
> > I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I
> > don't understand the spillove
>
> Wow, it's something new. What is your upgrade path?
>
>
I keep current with community.  All osds have all been rebuilt as of
luminous.



> Also, you record cluster metrics, like via prometheus? To see diff
> between upgrades.
>
>
Unfortunately not.  I've only had prometheus running for about two weeks
nd I had it turned off for a couple days for some unknown reason... :/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Mazzystr
I am also seeing BlueFS spill since updating to Nautilus.  I also see high
slow_used_bytes and slow_total_bytes metrics.  It sure looks to me that the
only solution is to zap and rebuilt the osd.  I had to manually check 36
osds some of them traditional processes and some containerized.  The lack
of tooling here is underwhelming...  As soon as I rebuilt the osd the
"BlueFS spill..." warning went away.

I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I don't
understand the spillover.


On Fri, Feb 15, 2019 at 12:33 PM David Turner  wrote:

> The answer is probably going to be in how big your DB partition is vs how
> big your HDD disk is.  From your output it looks like you have a 6TB HDD
> with a 28GB Blocks.DB partition.  Even though the DB used size isn't
> currently full, I would guess that at some point since this OSD was created
> that it did fill up and what you're seeing is the part of the DB that
> spilled over to the data disk.  This is why the official recommendation
> (that is quite cautious, but cautious because some use cases will use this
> up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
> that's a recommendation of 240GB per DB partition.  Of course the actual
> size of the DB needed is dependent on your use case.  But pretty much every
> use case for a 6TB disk needs a bigger partition than 28GB.
>
> On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin 
> wrote:
>
>> Wrong metadata paste of osd.73 in previous message.
>>
>>
>> {
>>
>>  "id": 73,
>>  "arch": "x86_64",
>>  "back_addr": "10.10.10.6:6804/175338",
>>  "back_iface": "vlan3",
>>  "bluefs": "1",
>>  "bluefs_db_access_mode": "blk",
>>  "bluefs_db_block_size": "4096",
>>  "bluefs_db_dev": "259:22",
>>  "bluefs_db_dev_node": "nvme2n1",
>>  "bluefs_db_driver": "KernelDevice",
>>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>>  "bluefs_db_rotational": "0",
>>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>>  "bluefs_db_size": "30064771072",
>>  "bluefs_db_type": "nvme",
>>  "bluefs_single_shared_device": "0",
>>  "bluefs_slow_access_mode": "blk",
>>  "bluefs_slow_block_size": "4096",
>>  "bluefs_slow_dev": "8:176",
>>  "bluefs_slow_dev_node": "sdl",
>>  "bluefs_slow_driver": "KernelDevice",
>>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>>  "bluefs_slow_partition_path": "/dev/sdl2",
>>  "bluefs_slow_rotational": "1",
>>  "bluefs_slow_size": "6001069199360",
>>  "bluefs_slow_type": "hdd",
>>  "bluefs_wal_access_mode": "blk",
>>  "bluefs_wal_block_size": "4096",
>>  "bluefs_wal_dev": "259:22",
>>  "bluefs_wal_dev_node": "nvme2n1",
>>  "bluefs_wal_driver": "KernelDevice",
>>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>>  "bluefs_wal_rotational": "0",
>>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>>  "bluefs_wal_size": "1073741824",
>>  "bluefs_wal_type": "nvme",
>>  "bluestore_bdev_access_mode": "blk",
>>  "bluestore_bdev_block_size": "4096",
>>  "bluestore_bdev_dev": "8:176",
>>  "bluestore_bdev_dev_node": "sdl",
>>  "bluestore_bdev_driver": "KernelDevice",
>>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>>  "bluestore_bdev_partition_path": "/dev/sdl2",
>>  "bluestore_bdev_rotational": "1",
>>  "bluestore_bdev_size": "6001069199360",
>>  "bluestore_bdev_type": "hdd",
>>  "ceph_version": "ceph version 12.2.10
>> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>>  "default_device_class": "hdd",
>>  "distro": "centos",
>>  "distro_description": "CentOS Linux 7 (Core)",
>>  "distro_version": "7",
>>  "front_addr": "172.16.16.16:6803/175338",
>>  "front_iface": "vlan4",
>>  "hb_back_addr": "10.10.10.6:6805/175338",
>>  "hb_front_addr": "172.16.16.16:6805/175338",
>>  "hostname": "ceph-osd5",
>>  "journal_rotational": "0",
>>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>>  "mem_swap_kb": "0",
>>  "mem_total_kb": "65724256",
>>  "os": "Linux",
>>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>>  "osd_objectstore": "bluestore",
>>  "rotational": "1"
>> }
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests in cache tier with rep_size 2

2017-11-01 Thread Mazzystr
I disagree.

We have the following setting...
  osd pool default size = 3
  osd pool default min size = 1

There's maths that need to be conducted for 'osd pool default size'.  A
setting of 3 and 1 allows for 2 disks to fail ... at the same time ...
without a loss of data.  This is standard storage admin practice.  NetApp
advertises this as protection from double disk failure.

Higher replication settings need to take account desired usable capacity,
power, data center space, mean time to replication.  I was at a site that
had 8000 disks in an hdfs cluster.  We were replacing a dozen disks
weekly.

On the flip side shutting down client access because of a disk failure in
the cluster is *unacceptable* to a product

On Wed, Nov 1, 2017 at 10:08 AM, David Turner  wrote:

> PPS - or min_size 1 in production
>
> On Wed, Nov 1, 2017 at 10:08 AM David Turner 
> wrote:
>
>> What is your min_size in the cache pool?  If your min_size is 2, then the
>> cluster would block requests to that pool due to it having too few copies
>> available.
>>
>> PS - Please don't consider using rep_size 2 in production.
>>
>> On Wed, Nov 1, 2017 at 5:14 AM Eugen Block  wrote:
>>
>>> Hi experts,
>>>
>>> we have upgraded our cluster to Luminous successfully, no big issues
>>> so far. We are also testing cache tier with only 2 SSDs (we know it's
>>> not recommended), and there's one issue to be resolved:
>>>
>>> Evertime we have to restart the cache OSDs we get slow requests with
>>> impacts on our client VMs. We never restart both SSDs at the same
>>> time, of course, so we are wondering why the slow requests? What
>>> exactly is happening with a replication size of 2? We assumed that if
>>> one SSD is restarted the data is still available on the other disc, we
>>> didn't expect any problems. But this occurs everytime we have to
>>> restart them. Can anyone explain what is going on there? We could use
>>> your expertise on this :-)
>>>
>>> Best regards,
>>> Eugen
>>>
>>> --
>>> Eugen Block voice   : +49-40-559 51 75
>>> <+49%2040%205595175>
>>> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
>>> <+49%2040%205595177>
>>> Postfach 61 03 15
>>> D-22423 Hamburg e-mail  : ebl...@nde.ag
>>>
>>>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>>Sitz und Registergericht: Hamburg, HRB 90934
>>>Vorstand: Jens-U. Mozdzen
>>> USt-IdNr. DE 814 013 983
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-11-01 Thread Mazzystr
I experienced this as well on tiny Ceph cluster testing...

HW spec - 3x
Intel i7-4770K quad core
32Gb m2/ssd
8Gb memory
Dell PERC H200
6 x 3Tb Seagate
Centos 7.x
Ceph 12.x

I also run 3 memory hungry procs on the Ceph nodes.  Obviously there is a
memory problem here.  Here are the steps I took avoid oom-killer killing
the node ...

/etc/rc.local -
for i in $(pgrep ceph-mon); do echo -17 > /proc/$i/oom_score_adj; done
for i in $(pgrep ceph-osd); do echo -17 > /proc/$i/oom_score_adj; done
for i in $(pgrep ceph-mgr); do echo 50 > /proc/$i/oom_score_adj; done

/etc/sysctl.conf -
vm.swappiness = 100
vm.vfs_cache_pressure = 1000
vm.min_free_kbytes = 512

/etc/ceph/ceph.conf -
[osd]
bluestore_cache_size = 52428800
bluestore_cache_size_hdd = 52428800
bluestore_cache_size_ssd = 52428800
bluestore_cache_kv_max = 52428800

You're going to see memory page-{in,out} skyrocket with this setup but it
should keep oom-killer at bay until a memory fix can be applied.  Client
performance to the cluster wasn't spectacular but wasn't terrible.  I was
seeing +/- 60Mb/sec of bandwidth.

Ultimately I upgraded the nodes to 16Gb

/Chris C

On Tue, Oct 31, 2017 at 10:30 PM, shadow_lin  wrote:

> Hi Sage,
> We have tried compiled the latest ceph source code from github.
> The build is ceph version 12.2.1-249-g42172a4 (
> 42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable).
> The memory problem seems better but the memory usage of osd is still keep
> increasing as more data are wrote into the rbd image and the memory usage
> won't drop after the write is stopped.
>Could you specify from which commit the memeory bug is fixed?
> Thanks
> 2017-11-01
> --
> lin.yunfan
> --
>
> *发件人:*Sage Weil 
> *发送时间:*2017-10-24 20:03
> *主题:*Re: [ceph-users] [luminous]OSD memory usage increase when writing a
> lot of data to cluster
> *收件人:*"shadow_lin"
> *抄送:*"ceph-users"
>
> On Tue, 24 Oct 2017, shadow_lin wrote:
> > BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body
> > {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All,
> > The cluster has 24 osd with 24 8TB hdd.
> > Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD.
> I know the memory
> > is below the remmanded value, but this osd server is
> an ARM  server so I
> > can't do anything to add more ram.
> > I created a replicated(2 rep) pool and an 20TB image
> and mounted to the test
> > server with xfs fs.
> >
> > I have set the ceph.conf to this(according to other
> related post suggested):
> > [osd]
> > bluestore_cache_size = 104857600
> > bluestore_cache_size_hdd = 104857600
> > bluestore_cache_size_ssd = 104857600
> > bluestore_cache_kv_max = 103809024
> >
> >  osd map cache size = 20
> > osd map max advance = 10
> > osd map share max epochs = 10
> > osd pg epoch persisted max stale = 10
> > The bluestore cache setting did improve the situation,but
> if i try to write
> > 1TB data by dd command(dd if=/dev/zero of=test bs=1G
> count=1000)  to rbd the
> > osd will eventually be killed by oom killer.
> > If I only wirte like 100G  data once then everything is fine.
> >
> > Why does the osd memory usage keep increasing whle writing ?
> > Is there anything I can do to reduce the memory usage?
>
> There is a bluestore memory bug that was fixed just after 12.2.1 was
> released; it will be fixed in 12.2.2.  In the meantime, you can run
> consider running the latest luminous branch (not fully tested) from
> https://shaman.ceph.com/builds/ceph/luminous.
>
> sage
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recover ceph fs files

2017-10-31 Thread Mazzystr
I've been experimenting with moving Ceph from single node to mult node on
the tiniest x86 that money buys.

Tell me if you've heard this one before.  :)  Over the weekend I
unexpectedly lost my root file system and along with it the single mons.  I
reinstalled the os, reinstalled ceph, recovered the ceph cluster and I'm
now in a HEALTH_OK state.  rados ls shows all my objects.

I'm now trying to recover ceph fs files.  I have run through
cephfs-data-scan { scan_extents, scan_inodes } twice and still no posix
files when fs mounts up.  I'm thinking of just rmming metadata pool,
recreating and running cephfs-data-scan's again.

Does anyone have any suggestions?

Thanks,
/Chris C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread Mazzystr
Also examine your network layout.  Any saturation in the private cluster
network or client facing network will be felt in clients / libvirt /
virtual machines

As OSD count increases...

   - Ensure client network private cluster network seperation - different
   nics, different wires, different switches
   - Add more nics both client side and private cluster network side and
   lag them.
   - If/When your dept's budget suddenly swells...implement 10 gig-e.


Monitor, capacity plan, execute  :)

/Chris C

On Tue, Aug 22, 2017 at 3:02 PM, Maged Mokhtar  wrote:

>
>
> It is likely your 2 spinning disks cannot keep up with the load. Things
> are likely to improve if you double your OSDs hooking them up to your
> existing SSD journal. Technically it would be nice to run a
> load/performance tool (either atop/collectl/sysstat) and measure how busy
> your resources are, but it is most likely your 2 spinning disks will show
> near 100% busy utilization.
>
> filestore_max_sync_interval: i do not recommend decreasing this to 0.1, i
> would keep it at 5 sec
>
> osd_op_threads do not increase this unless you have enough cores.
>
> but adding disks is the way to go
>
> Maged
>
>
>
> On 2017-08-22 20:08, fcid wrote:
>
> Hello everyone,
>
> I've been using ceph to provide storage using RBD for 60 KVM virtual
> machines running on proxmox.
>
> The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a
> total of 3 nodes) and we are having some performace issues, like big
> latency times (apply lat:~0.5 s; commit lat: 0.001 s), which get worse by
> the weekly deep-scrubs.
>
> I wonder if doubling the numbers of OSDs would improve latency times, or
> if there is any other configuration tweak recommended for such small
> cluster. Also, I'm looking forward to read any experience of other users
> using a similiar configuration.
>
> Some technical info:
>
>   - Ceph version: 10.2.5
>
>   - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle for
> backend disk.
>
>   - Using CFQ disk queue scheduler
>
>   - OSD configuration excerpt:
>
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 63
> osd_client_op_priority = 1
> osd_mkfs_options = -f -i size=2048 -n size=64k
> osd_mount_options_xfs = inode64,noatime,logbsize=256k
> osd_journal_size = 20480
> osd_op_threads = 12
> osd_disk_threads = 1
> osd_disk_thread_ioprio_class = idle
> osd_disk_thread_ioprio_priority = 7
> osd_scrub_begin_hour = 3
> osd_scrub_end_hour = 8
> osd_scrub_during_recovery = false
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> filestore_xattr_use_omap = true
> filestore_queue_max_ops = 2500
> filestore_min_sync_interval = 0.01
> filestore_max_sync_interval = 0.1
> filestore_journal_writeahead = true
>
> Best regards,
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] LevelDB corruption

2017-08-01 Thread Mazzystr
Sorry to take so long in replying

I ended up evacuating data and rebuilding using Luminous with BlueStore
OSDs.  I need my usual drive/host failure testing before going live.  Of
course other things are burning right now and have my attention.  Hopefully
I can finish that work in the next few days.

On Wed, Jun 28, 2017 at 10:11 PM, Mazzystr <mazzy...@gmail.com> wrote:

> I should be able to try that tomorrow.
>
> I'll report back in afterward
>
> On Wed, Jun 28, 2017 at 10:09 PM, Brad Hubbard <bhubb...@redhat.com>
> wrote:
>
>> On Thu, Jun 29, 2017 at 11:58 AM, Mazzystr <mazzy...@gmail.com> wrote:
>> > just one MON
>>
>> Try just replacing that MON then?
>>
>> >
>> > On Wed, Jun 28, 2017 at 8:05 PM, Brad Hubbard <bhubb...@redhat.com>
>> wrote:
>> >>
>> >> On Wed, Jun 28, 2017 at 10:18 PM, Mazzystr <mazzy...@gmail.com> wrote:
>> >> > The corruption is back in mons logs...
>> >> >
>> >> > 2017-06-28 08:16:53.078495 7f1a0b9da700  1 leveldb: Compaction error:
>> >> > Corruption: bad entry in block
>> >> > 2017-06-28 08:16:53.078499 7f1a0b9da700  1 leveldb: Waiting after
>> >> > background
>> >> > compaction error: Corruption: bad entry in block
>> >>
>> >> Is this just one MON, or is it in the logs of all of your MONs?
>> >>
>> >> >
>> >> >
>> >> > On Tue, Jun 27, 2017 at 10:42 PM, Mazzystr <mazzy...@gmail.com>
>> wrote:
>> >> >>
>> >> >> 22:16 ccallegar: good grief...talk about a handful of sand in your
>> eye!
>> >> >> I've been chasing down a "leveldb: Compaction error: Corruption: bad
>> >> >> entry
>> >> >> in block " in mons logs...
>> >> >> 22:17 ccallegar: I ran a python leveldb.repair() and restarted osd's
>> >> >> and
>> >> >> mons and my cluster crashed and burned
>> >> >> 22:18 ccallegar: a couple files ended up in leveldb lost dirs.  The
>> >> >> path
>> >> >> is different if it's a mons or osd
>> >> >> 22:19 ccallegar: for mons logs showed a MANIFEST file missing.  I
>> moved
>> >> >> the file that landed in lost back to normal position, chown'd
>> >> >> ceph:ceph,
>> >> >> restarted mons and mons came back online!
>> >> >> 22:21 ccallegar: osd logs showed a sst file missing.  looks like
>> >> >> leveldb.repair() does the needful but names the new file a .ldb.  I
>> >> >> renamed
>> >> >> the file, chown'd ceph:ceph, restarted osd and they came back
>> online!
>> >> >>
>> >> >> leveldb corruption log entries have gone away and my cluster is
>> >> >> recovering
>> >> >> it's way to happiness.
>> >> >>
>> >> >> Hopefully this helps someone else out
>> >> >>
>> >> >> Thanks,
>> >> >> /Chris
>> >> >>
>> >> >>
>> >> >> On Tue, Jun 27, 2017 at 6:39 PM, Mazzystr <mazzy...@gmail.com>
>> wrote:
>> >> >>>
>> >> >>> Hi Ceph Users,
>> >> >>> I've been chasing down some levelDB corruption messages in my mons
>> >> >>> logs.
>> >> >>> I ran a python leveldb repair on mon and odd leveldbs.  The job
>> caused
>> >> >>> a
>> >> >>> files to disappear and a log file to appear in lost directory.  Mon
>> >> >>> and
>> >> >>> osd's refuse to boot.
>> >> >>>
>> >> >>> Ceph version is kraken 11.02.
>> >> >>>
>> >> >>> There's not a whole lot of info on the internet regarding this.
>> >> >>> Anyone
>> >> >>> have any ideas on how to recover the mess?
>> >> >>>
>> >> >>> Thanks,
>> >> >>> /Chris C
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Cheers,
>> >> Brad
>> >
>> >
>>
>>
>>
>> --
>> Cheers,
>> Brad
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs metadata damage and scrub error

2017-07-18 Thread Mazzystr
Any update to this?  I also have the same problem

# for i in $(cat pg_dump | grep 'stale+active+clean' | awk {'print $1'});
do echo -n "$i: "; rados list-inconsistent-obj $i; echo; done
107.ff: {"epoch":10762,"inconsistents":[]}
.
and so on for 49 pg's that I think I had a problem with

# ceph tell mds.ceph damage ls | python -m "json.tool"
2017-07-18 16:28:08.657673 7f766e629700  0 client.1923797 ms_handle_reset
on 192.168.1.10:6800/1268574779
2017-07-18 16:28:08.665693 7f76577fe700  0 client.1923798 ms_handle_reset
on 192.168.1.10:6800/1268574779
[
{
"damage_type": "dir_frag",
"frag": "*",
"id": 4153356868,
"ino": 1099511661266
}
]

# cat dirs_ceph_inodes | grep 1099511661266
1099511661266 drwxr-xr-x 1 chris chris 1 May  4  2017 /mnt/ceph/2017/05/04/

# rm -rf /mnt/ceph/2017/05/04/
rm: cannot remove ‘/mnt/ceph/2017/05/04/’: Directory not empty

# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)

All osd's are on 11.2.0


I am A OK with losing the directory and the contents...just need to get to
happy pastures

Thanks,
/Chris Callegari



On Tue, May 30, 2017 at 7:42 AM, James Eckersall 
wrote:

> Further to this, we managed to repair the inconsistent PG by comparing the
> object digests and removing the one that didn't match (3 of 4 replicas had
> the same digest, 1 didn't) and then issuing a pg repair and scrub.
> This has removed the inconsistent flag on the PG, however, we are still
> seeing the mds report damage.
>
> We tried removing the damage from the mds with damage rm, then ran a
> recursive stat across the problem directory, but the damage re-appeared.
> Tried dong a scrub_path, but the command returned code -2 and the mds log
> shows that the scrub started and finished less than 1ms later.
>
> Any further help is greatly appreciated.
>
> On 17 May 2017 at 10:58, James Eckersall 
> wrote:
>
>> An update to this.  The cluster has been upgraded to Kraken, but I've
>> still got the same PG reporting inconsistent and the same error message
>> about mds metadata damaged.
>> Can anyone offer any further advice please?
>> If you need output from the ceph-osdomap-tool, could you please explain
>> how to use it?  I haven't been able to find any docs that explain.
>>
>> Thanks
>> J
>>
>> On 3 May 2017 at 14:35, James Eckersall 
>> wrote:
>>
>>> Hi David,
>>>
>>> Thanks for the reply, it's appreciated.
>>> We're going to upgrade the cluster to Kraken and see if that fixes the
>>> metadata issue.
>>>
>>> J
>>>
>>> On 2 May 2017 at 17:00, David Zafman  wrote:
>>>

 James,

 You have an omap corruption.  It is likely caused by a bug which
 has already been identified.  A fix for that problem is available but it is
 still pending backport for the next Jewel point release.  All 4 of your
 replicas have different "omap_digest" values.

 Instead of the xattrs the ceph-osdomap-tool --command
 dump-objects-with-keys output from OSDs 3, 10, 11, 23 would be interesting
 to compare.

 ***WARNING*** Please backup your data before doing any repair attempts.

 If you can upgrade to Kraken v11.2.0, it will auto repair the omaps on
 ceph-osd start up.  It will likely still require a ceph pg repair to make
 the 4 replicas consistent with each other.  The final result may be the
 reappearance of removed MDS files in the directory.

 If you can recover the data, you could remove the directory entirely
 and rebuild it.  The original bug was triggered during omap deletion
 typically in a large directory which corresponds to an individual unlink in
 cephfs.

 If you can build a branch in github to get the newer ceph-osdomap-tool
 you could try to use it to repair the omaps.

 David


 On 5/2/17 5:05 AM, James Eckersall wrote:

 Hi,

 I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
 Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
 This cluster provides RBDs and a CephFS filesystem to a number of clients.

 ceph health detail is showing the following errors:

 pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
 1 scrub errors
 mds0: Metadata damage detected


 The pg 2.9 is in the cephfs_metadata pool (id 2).

 I've looked at the OSD logs for OSD 3, which is the primary for this PG,
 but the only thing that appears relating to this PG is the following:

 log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors

 After initiating a ceph pg repair 2.9, I see the following in the primary
 OSD log:

 log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
 log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors


 I found the below command in a previous ceph-users post.  Running this
 

Re: [ceph-users] LevelDB corruption

2017-06-28 Thread Mazzystr
just one MON

On Wed, Jun 28, 2017 at 8:05 PM, Brad Hubbard <bhubb...@redhat.com> wrote:

> On Wed, Jun 28, 2017 at 10:18 PM, Mazzystr <mazzy...@gmail.com> wrote:
> > The corruption is back in mons logs...
> >
> > 2017-06-28 08:16:53.078495 7f1a0b9da700  1 leveldb: Compaction error:
> > Corruption: bad entry in block
> > 2017-06-28 08:16:53.078499 7f1a0b9da700  1 leveldb: Waiting after
> background
> > compaction error: Corruption: bad entry in block
>
> Is this just one MON, or is it in the logs of all of your MONs?
>
> >
> >
> > On Tue, Jun 27, 2017 at 10:42 PM, Mazzystr <mazzy...@gmail.com> wrote:
> >>
> >> 22:16 ccallegar: good grief...talk about a handful of sand in your eye!
> >> I've been chasing down a "leveldb: Compaction error: Corruption: bad
> entry
> >> in block " in mons logs...
> >> 22:17 ccallegar: I ran a python leveldb.repair() and restarted osd's and
> >> mons and my cluster crashed and burned
> >> 22:18 ccallegar: a couple files ended up in leveldb lost dirs.  The path
> >> is different if it's a mons or osd
> >> 22:19 ccallegar: for mons logs showed a MANIFEST file missing.  I moved
> >> the file that landed in lost back to normal position, chown'd ceph:ceph,
> >> restarted mons and mons came back online!
> >> 22:21 ccallegar: osd logs showed a sst file missing.  looks like
> >> leveldb.repair() does the needful but names the new file a .ldb.  I
> renamed
> >> the file, chown'd ceph:ceph, restarted osd and they came back online!
> >>
> >> leveldb corruption log entries have gone away and my cluster is
> recovering
> >> it's way to happiness.
> >>
> >> Hopefully this helps someone else out
> >>
> >> Thanks,
> >> /Chris
> >>
> >>
> >> On Tue, Jun 27, 2017 at 6:39 PM, Mazzystr <mazzy...@gmail.com> wrote:
> >>>
> >>> Hi Ceph Users,
> >>> I've been chasing down some levelDB corruption messages in my mons
> logs.
> >>> I ran a python leveldb repair on mon and odd leveldbs.  The job caused
> a
> >>> files to disappear and a log file to appear in lost directory.  Mon and
> >>> osd's refuse to boot.
> >>>
> >>> Ceph version is kraken 11.02.
> >>>
> >>> There's not a whole lot of info on the internet regarding this.  Anyone
> >>> have any ideas on how to recover the mess?
> >>>
> >>> Thanks,
> >>> /Chris C
> >>
> >>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean after removing OSDs

2017-06-28 Thread Mazzystr
I've been using this procedure to remove OSDs...

OSD_ID=
ceph auth del osd.${OSD_ID}
ceph osd down ${OSD_ID}
ceph osd out ${OSD_ID}
ceph osd rm ${OSD_ID}
ceph osd crush remove osd.${OSD_ID}
systemctl disable ceph-osd@${OSD_ID}.service
systemctl stop ceph-osd@${OSD_ID}.service
sed -i "/ceph-$OSD_ID/d" /etc/fstab
umount /var/lib/ceph/osd/ceph-${OSD_ID}

Would you say this is the correct order of events?

Thanks!


On Wed, Jun 28, 2017 at 9:34 AM, David Turner  wrote:

> A couple things.  You didn't `ceph osd crush remove osd.21` after doing
> the other bits.  Also you will want to remove the bucket (re: host) from
> the crush map as it will now be empty.  Right now you have a host in the
> crush map with a weight, but no osds to put that data on.  It has a weight
> because of the 2 OSDs that are still in it that were removed from the
> cluster but not from the crush map.  It's confusing to your cluster.
>
> If you had removed the OSDs from the crush map when you ran the other
> commands, then the dead host would have still been in the crush map but
> with a weight of 0 and wouldn't cause any problems.
>
> On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak  wrote:
>
>> Hello,
>>
>> TL;DR: what to do when my cluster reports stuck unclean pgs?
>>
>> Detailed description:
>>
>> One of the nodes in my cluster died. CEPH correctly rebalanced itself,
>> and reached the HEALTH_OK state. I have looked at the failed server,
>> and decided to take it out of the cluster permanently, because the
>> hardware
>> is indeed faulty. It used to host two OSDs, which were marked down and out
>> in "ceph osd dump".
>>
>> So from the HEALTH_OK I ran the following commands:
>>
>> # ceph auth del osd.20
>> # ceph auth del osd.21
>> # ceph osd rm osd.20
>> # ceph osd rm osd.21
>>
>> After that, CEPH started to rebalance itself, but now it reports some PGs
>> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":
>>
>> # ceph -s
>> cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
>>  health HEALTH_WARN
>> 350 pgs stuck unclean
>> recovery 26/1596390 objects degraded (0.002%)
>> recovery 58772/1596390 objects misplaced (3.682%)
>>  monmap e16: 3 mons at {...}
>> election epoch 584, quorum 0,1,2 ...
>>  osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
>> flags require_jewel_osds
>>   pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
>> 6244 GB used, 40569 GB / 46814 GB avail
>> 26/1596390 objects degraded (0.002%)
>> 58772/1596390 objects misplaced (3.682%)
>> 3426 active+clean
>>  349 active+remapped
>>1 active
>>   client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr
>>
>> # ceph health detail
>> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded
>> (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
>> pg 28.fa is stuck unclean for 14408925.966824, current state
>> active+remapped, last acting [38,52,4]
>> pg 28.e7 is stuck unclean for 14408925.966886, current state
>> active+remapped, last acting [29,42,22]
>> pg 23.dc is stuck unclean for 61698.641750, current state
>> active+remapped, last acting [50,33,23]
>> pg 23.d9 is stuck unclean for 61223.093284, current state
>> active+remapped, last acting [54,31,23]
>> pg 28.df is stuck unclean for 14408925.967120, current state
>> active+remapped, last acting [33,7,15]
>> pg 34.38 is stuck unclean for 60904.322881, current state
>> active+remapped, last acting [18,41,9]
>> pg 34.fe is stuck unclean for 60904.241762, current state
>> active+remapped, last acting [58,1,44]
>> [...]
>> pg 28.8f is stuck unclean for 66102.059671, current state active, last
>> acting [8,40,5]
>> [...]
>> recovery 26/1596390 objects degraded (0.002%)
>> recovery 58772/1596390 objects misplaced (3.682%)
>>
>> Apart from that, the data stored in CEPH pools seems to be reachable
>> and usable as before.
>>
>> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH
>> repository).
>>
>> What other debugging info should I provide, or what to do in order
>> to unstuck the stuck pgs? Thanks!
>>
>> -Yenya
>>
>> --
>> | Jan "Yenya" Kasprzak > private}> |
>> | http://www.fi.muni.cz/~kas/ GPG:
>> 4096R/A45477D5 |
>> > That's why this kind of vulnerability is a concern: deploying stuff is
>> <
>> > often about collecting an obscene number of .jar files and pushing them
>> <
>> > up to the application server.  --pboddie at LWN
>> <
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

Re: [ceph-users] LevelDB corruption

2017-06-28 Thread Mazzystr
The corruption is back in mons logs...

2017-06-28 08:16:53.078495 7f1a0b9da700  1 leveldb: Compaction error:
Corruption: bad entry in block
2017-06-28 08:16:53.078499 7f1a0b9da700  1 leveldb: Waiting after
background compaction error: Corruption: bad entry in block


On Tue, Jun 27, 2017 at 10:42 PM, Mazzystr <mazzy...@gmail.com> wrote:

> 22:16 ccallegar: good grief...talk about a handful of sand in your eye!
> I've been chasing down a "leveldb: Compaction error: Corruption: bad entry
> in block " in mons logs...
> 22:17 ccallegar: I ran a python leveldb.repair() and restarted osd's and
> mons and my cluster crashed and burned
> 22:18 ccallegar: a couple files ended up in leveldb lost dirs.  The path
> is different if it's a mons or osd
> 22:19 ccallegar: for mons logs showed a MANIFEST file missing.  I moved
> the file that landed in lost back to normal position, chown'd ceph:ceph,
> restarted mons and mons came back online!
> 22:21 ccallegar: osd logs showed a sst file missing.  looks like
> leveldb.repair() does the needful but names the new file a .ldb.  I renamed
> the file, chown'd ceph:ceph, restarted osd and they came back online!
>
> leveldb corruption log entries have gone away and my cluster is recovering
> it's way to happiness.
>
> Hopefully this helps someone else out
>
> Thanks,
> /Chris
>
>
> On Tue, Jun 27, 2017 at 6:39 PM, Mazzystr <mazzy...@gmail.com> wrote:
>
>> Hi Ceph Users,
>> I've been chasing down some levelDB corruption messages in my mons logs.
>> I ran a python leveldb repair on mon and odd leveldbs.  The job caused a
>> files to disappear and a log file to appear in lost directory.  Mon and
>> osd's refuse to boot.
>>
>> Ceph version is kraken 11.02.
>>
>> There's not a whole lot of info on the internet regarding this.  Anyone
>> have any ideas on how to recover the mess?
>>
>> Thanks,
>> /Chris C
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] LevelDB corruption

2017-06-27 Thread Mazzystr
Hi Ceph Users,
I've been chasing down some levelDB corruption messages in my mons logs.  I
ran a python leveldb repair on mon and odd leveldbs.  The job caused a
files to disappear and a log file to appear in lost directory.  Mon and
osd's refuse to boot.

Ceph version is kraken 11.02.

There's not a whole lot of info on the internet regarding this.  Anyone
have any ideas on how to recover the mess?

Thanks,
/Chris C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan pg_files missing

2017-06-20 Thread Mazzystr
I'm on Red Hat Storage 2.2 (ceph-10.2.7-0.el7.x86_64) and I see this...
# cephfs-data-scan
Usage:
  cephfs-data-scan init [--force-init]
  cephfs-data-scan scan_extents [--force-pool] 
  cephfs-data-scan scan_inodes [--force-pool] [--force-corrupt] 

--force-corrupt: overrite apparently corrupt structures
--force-init: write root inodes even if they exist
--force-pool: use data pool even if it is not in FSMap

  cephfs-data-scan scan_frags [--force-corrupt]

  cephfs-data-scan tmap_upgrade 

  --conf/-c FILEread configuration from the given configuration file
  --id/-i IDset ID portion of my name
  --name/-n TYPE.ID set name
  --cluster NAMEset cluster name (default: ceph)
  --setuser USERset uid to user or uid (and gid to user's gid)
  --setgroup GROUP  set gid to group or gid
  --version show version and quit


Anyone know where "cephfs-data-scan pg_files   [...]"
went per docs ?

Thanks,
/Chris Callegari
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk mishap + bad disk and xfs corruption = stuck PG's

2017-06-16 Thread Mazzystr
I resolved this situation and the results are a testament on how amazing
Ceph is.

I had replication factor set to 2 during my data migration to ensure there
was enough capacity on the target as I cycled disks from Drobo storage to
Ceph storage.  I forgot about that setting.  I started having mech and
inevitable failure on a disk so started evacuating objects from that disk
via crush reweight osd.  Objects started to fly across osd's.  A bunch of
data was moved and some was left to be moved. Impatient me started
tinkering on some other part of the cluster and I accidentally "zapped" a
disk.  Things were still "ok".  Then the failing disk finally died and
dropped out of the filesystem.  I rebooted the node.  The xfs filesystem
corrupted.  xfs_repair wiped out inodes and put everything into
lost+found.  This was a BIG uh-oh.  Ceph reacted badly and halted with 46
pg's in active+stuck+stale status and filesystem halted.  About 800,000
objects were handled by the pg's. Mother eff!

I ended up dropping the two disks completely from cluster, dropping the
pg's that were managing the objects. For filesystem I just simply deleted
it.

Dum dum dum!  On a traditional storage system these acts would've been
catastrophic to the "volume"!

But not for Ceph.

I readded the good disk to the cluster and added a replacement disk to the
cluster.  I recreated the pg's as blank pg's.  This got the Ceph cluster
healthy.  For filesystem I rescanned for indexes which an object to inode
map, recreated the inodes. The made for a healthy posix filesystem and
mounted it up.

Total loss was a few thousand minor fringe files.  My video surveillance
archive is 100% intact.

Thanks!
/Chris Callegari

ps... post mortem actions: my pool size got set to 3 since I now have raw
capacity to do so.  ;-)


On Fri, Jun 9, 2017 at 5:11 PM, Mazzystr <mazzy...@gmail.com> wrote:

> Well I did bad I just don't know how bad yet.  Before we get into it my
> critical data is backed up to CrashPlan.  I'd rather not lose all my
> archive data.  Losing some of the data is ok.
>
> I added a bunch of disks to my ceph cluster so I turned off the cluster
> and dd'd the raw disks around so that the disks and osd's were ordered by
> id's on the HBA.  I fat fingered one disk and overwrote it.  Another disk
> didn't dd correctly... it seems to have not unmounted correctly plus it has
> some failures according to smartctl.  A repair_xfs command put a whole
> bunch of data into lost+found.
>
> I brought the cluster up and let it settle down.  The result is 49 stuck
> pg's and CephFS is halted.
>
> ceph -s is here <https://pastebin.com/RXSinLjZ>
> ceph osd tree is here <https://pastebin.com/qmE0dhyH>
> ceph pg dump minus the active pg's is here <https://pastebin.com/36kpmA8s>
>
> OSD-2 is gone with no chance to restore it.
>
> OSD-3 had the xfs corruption.  I have a bunch of
> /var/lib/ceph/osd/ceph-3/lost+found/blah/DIR_[0-9]+/blah.blah__head_blah.blah
> files after xfs_repair.  I for looped these files through ceph osd map
>  $file and it seems they have all been replicated to other OSD's.
> It seems to be safe to delete this data.
>
> There are files named [0-9]+ in the top level of
> /var/lib/ceph/osd/ceph-3/lost+found.  I don't know what to do with these
> files.
>
>
> I have a couple questions:
> 1) can the top level lost+found files be used to recreate the stuck pg's?
>
> 2a) can the pg's be dropped and recreated to bring the cluster to a
> healthy state?
> 2b) if i do this can CephFS be restored with just partial data loss?  The
> cephfs documentation isn't quite clear on how to do this.
>
> Thanks for your time and help!
> /Chris
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HA Filesystem mode (MON, OSD, MDS) with Ceph and HAof MDS daemon.

2017-06-12 Thread Mazzystr
Since your app is an Apache / php app is it possible for you to reconfigure
the app to use S3 module rather than a posix open file()?  Then with Ceph
drop CephFS and configure Civetweb S3 gateway?  You can have
"active-active" endpoints with round robin dns or F5 or something.  You
would also have to repopulate objects into the rados pools.

Also increase that size parameter to 3.  ;-)

Lots of work for active-active but the whole stack will be much more
resilient coming from some with a ClearCase / NFS / stale file handles up
the wazoo background



On Mon, Jun 12, 2017 at 10:41 AM, Daniel Carrasco 
wrote:

> 2017-06-12 16:10 GMT+02:00 David Turner :
>
>> I have an incredibly light-weight cephfs configuration.  I set up an MDS
>> on each mon (3 total), and have 9TB of data in cephfs.  This data only has
>> 1 client that reads a few files at a time.  I haven't noticed any downtime
>> when it fails over to a standby MDS.  So it definitely depends on your
>> workload as to how a failover will affect your environment.
>>
>> On Mon, Jun 12, 2017 at 9:59 AM John Petrini 
>> wrote:
>>
>>> We use the following in our ceph.conf for MDS failover. We're running
>>> one active and one standby. Last time it failed over there was about 2
>>> minutes of downtime before the mounts started responding again but it did
>>> recover gracefully.
>>>
>>> [mds]
>>> max_mds = 1
>>> mds_standby_for_rank = 0
>>> mds_standby_replay = true
>>>
>>> ___
>>>
>>> John Petrini
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
> Thanks to both.
> Just now i'm working on that because I needs a very fast failover. For now
> the tests give me a very fast response when an OSD fails (about 5 seconds),
> but a very slow response when the main MDS fails (I've not tested the real
> time, but was not working for a long time). Maybe was because I created the
> other MDS after mount, because I've done some test just before send this
> email and now looks very fast (i've not noticed the downtime).
>
> Greetings!!
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] disk mishap + bad disk and xfs corruption = stuck PG's

2017-06-09 Thread Mazzystr
Well I did bad I just don't know how bad yet.  Before we get into it my
critical data is backed up to CrashPlan.  I'd rather not lose all my
archive data.  Losing some of the data is ok.

I added a bunch of disks to my ceph cluster so I turned off the cluster and
dd'd the raw disks around so that the disks and osd's were ordered by id's
on the HBA.  I fat fingered one disk and overwrote it.  Another disk didn't
dd correctly... it seems to have not unmounted correctly plus it has some
failures according to smartctl.  A repair_xfs command put a whole bunch of
data into lost+found.

I brought the cluster up and let it settle down.  The result is 49 stuck
pg's and CephFS is halted.

ceph -s is here 
ceph osd tree is here 
ceph pg dump minus the active pg's is here 

OSD-2 is gone with no chance to restore it.

OSD-3 had the xfs corruption.  I have a bunch of
/var/lib/ceph/osd/ceph-3/lost+found/blah/DIR_[0-9]+/blah.blah__head_blah.blah
files after xfs_repair.  I for looped these files through ceph osd map
 $file and it seems they have all been replicated to other OSD's.
It seems to be safe to delete this data.

There are files named [0-9]+ in the top level of
/var/lib/ceph/osd/ceph-3/lost+found.  I don't know what to do with these
files.


I have a couple questions:
1) can the top level lost+found files be used to recreate the stuck pg's?

2a) can the pg's be dropped and recreated to bring the cluster to a healthy
state?
2b) if i do this can CephFS be restored with just partial data loss?  The
cephfs documentation isn't quite clear on how to do this.

Thanks for your time and help!
/Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com