Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2022-01-10 Thread Bernhard Turmann

Hello Val, hello Ard,
I am not sure, but the issue might be fixed. There is an interesting comment in
the upstream changelog [1] of Ceph Pacific v16.2.5:

A long-standing bug that prevented 32-bit and 64-bit client/server 
interoperability under msgr v2 has been fixed. In particular, mixing armv7l 
(armhf) and x86_64 or aarch64 servers in the same cluster now works.


Ceph version 16.2.7 is now in sid/unstable which also passed autopkgtest for 
armhf.

Many Thanks to Thomas, Bernd et al for their work, much appreciated.

[1]:https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/

Best Regards
Berni



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2022-01-10 Thread Bernhard Turmann

  
  
Hello Val, hello Ard,
I am not sure, but the issue might be fixed. There is an interesting
comment in the upstream changelog [1] of Ceph Pacific v16.2.5:

A long-standing bug that prevented 32-bit
  and 64-bit client/server interoperability under msgr v2 has been
  fixed. In particular, mixing armv7l (armhf) and x86_64 or aarch64
  servers in the same cluster now works.

Ceph version 16.2.7 is now in sid/unstable which also passed
autopkgtest for armhf.

Many Thanks to Thomas, Bernd et al for their work, much appreciated.

[1]: https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/

Best Regards
Berni
  




Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-06-18 Thread Bernd Zeimetz
Hi Ard,

On 6/18/20 1:28 PM, Ard van Breemen wrote:
>> The biggest issue in maintaining ceph is to make it build on 32 bit
>> architectures. This seems not to be supported at all by upstream anymore.
> 
> First of all, I don't know what your goal is to support 32 bit.

Debian supports it, so it should be supported if possible.

> I do have a goal: I have loads of armhf machines and only so many amd64
> machines that do not even have enough memory to properly support ceph
> and being able to do something (as the MON uses 1GB of memory alone).
> I have multiple sites with this situation, and for the foreseeable
> future, we will still be building infrastructure on armhf. Getting a
> decent AMD64 setup in any location is additional and probably
> unnecessary costs.

You'll either need to migrate to amd64 (or arm/whatever64) or pay
somebody to fix ceph at upstream.


> I think the stance of the ceph community in this is: as long as nobody
> sends in patches they are not going to care. And they can't support it
> themselves because they have a totally different target (clouds).

Its the same: they support what they get paid for or what is needed.
People rarely use 32bit these days. Even on cheap arm devices 64bit is
the way to go.



> I am willing to host the armhf releases and maybe the i386 releases on
> my server, that way there will be 32 bit releases but not official ones.

Doesn't matter, hosting is not the issue here.

> But I do want your involvement.

You can want that, but you won't get it.
Send patches or people who will do the work.
I'll happily accept patches, or even better, but reports with links to
patches at upstream.


> I've been trying to compile it for a time, using sources from ceph and
> from proxmox, until I realised ceph nautilus is in backports. And it
> worked.
> So at least I want your guidance on how you build these... For now I've
> used an armhf machine, and I needed to limit the number of threads to 1
> due to c++ compiler needing more than 1GB of RAM to compile a single
> source.

Upstream has a detailed readme, or you can use the basic way to build a
debian package using dpkg-buildpackage, or similar tools.

> Not only do I want to make support complete so I can use hardware, I
> also think it's just bad programming not to use explicit sizes. And I am
> also on the verge of investing in amd64 clusters, I don't want it to
> depend on code that's depending on a lot of features.
> Anyway: I don't know how you build and test on non amd64 systems, do you
> also use armhf, or do you use a cross compile environment?

You can just build it, if you are using the Debian source.
Otherwise you'll need a lot of patches to make it build, and even more
to fix those various 32bit related bugs.



Bernd

-- 
 Bernd ZeimetzDebian GNU/Linux Developer
 http://bzed.dehttp://www.debian.org
 GPG Fingerprint: ECA1 E3F2 8E11 2432 D485  DD95 EB36 171A 6FF9 435F



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-06-18 Thread Ard van Breemen

Hi Bernd
On 2020-05-27 21:22, Bernd Zeimetz wrote:

sorry for not replying inline, but I thought I'd just share my general
opinion on this.

The biggest issue in maintaining ceph is to make it build on 32 bit
architectures. This seems not to be supported at all by upstream 
anymore.


First of all, I don't know what your goal is to support 32 bit.
I do have a goal: I have loads of armhf machines and only so many amd64 
machines that do not even have enough memory to properly support ceph 
and being able to do something (as the MON uses 1GB of memory alone).
I have multiple sites with this situation, and for the foreseeable 
future, we will still be building infrastructure on armhf. Getting a 
decent AMD64 setup in any location is additional and probably 
unnecessary costs.


Between 14.2.7 and 14.2.9 I had a longer look into the issue and 
started

to fix some issues, for example the parsing of config options does
pretty broken things if the default for the option does not fit into a
32bit integer. Fixing this properly brought me to various other places
where size_t is being used in the code, but actually an (at least)
uint64_t is being required.

Fedora already removed ceph for all 32bit architectures with a "not
supported by upstream anymore", but I was not able to find an official
statement from ceph upstream.


I think the stance of the ceph community in this is: as long as nobody 
sends in patches they are not going to care. And they can't support it 
themselves because they have a totally different target (clouds).



Also unfortunately I did not yet find the time to collect my findings
and send them to the ceph devel mailinglist, but I'd assume that they
just don't want to support 32bit anymore, otherwise they'd test it 
properly.


As the work to fix this is properly seems to be a rather long task, I
definitely won't do this. But I also don't want to upload maybe-working
binaries to Debian anymore. So unless somebody fixes and tests ceph for
32bit (or does this for Debian, also fine for me - running the
regression test suite is possible with enough resources and some
hardware), I will remove all 32bit architectures with the next upload.


My debian karma is bad, really bad. That's why I asked you what your 
goal is of supporting 32 bit. I have a goal. I might also be able to let 
64 bit lxc containers talk to 32 bit lxc containers and real armhf 
machines so I can test.
I am willing to host the armhf releases and maybe the i386 releases on 
my server, that way there will be 32 bit releases but not official ones. 
But I do want your involvement.
I've been trying to compile it for a time, using sources from ceph and 
from proxmox, until I realised ceph nautilus is in backports. And it 
worked.
So at least I want your guidance on how you build these... For now I've 
used an armhf machine, and I needed to limit the number of threads to 1 
due to c++ compiler needing more than 1GB of RAM to compile a single 
source.
Not only do I want to make support complete so I can use hardware, I 
also think it's just bad programming not to use explicit sizes. And I am 
also on the verge of investing in amd64 clusters, I don't want it to 
depend on code that's depending on a lot of features.
Anyway: I don't know how you build and test on non amd64 systems, do you 
also use armhf, or do you use a cross compile environment?


Regards,
Ard van Breemen



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-05-27 Thread Bernd Zeimetz
Hi,

sorry for not replying inline, but I thought I'd just share my general
opinion on this.

The biggest issue in maintaining ceph is to make it build on 32 bit
architectures. This seems not to be supported at all by upstream anymore.

Between 14.2.7 and 14.2.9 I had a longer look into the issue and started
to fix some issues, for example the parsing of config options does
pretty broken things if the default for the option does not fit into a
32bit integer. Fixing this properly brought me to various other places
where size_t is being used in the code, but actually an (at least)
uint64_t is being required.

Fedora already removed ceph for all 32bit architectures with a "not
supported by upstream anymore", but I was not able to find an official
statement from ceph upstream.

Also unfortunately I did not yet find the time to collect my findings
and send them to the ceph devel mailinglist, but I'd assume that they
just don't want to support 32bit anymore, otherwise they'd test it properly.

As the work to fix this is properly seems to be a rather long task, I
definitely won't do this. But I also don't want to upload maybe-working
binaries to Debian anymore. So unless somebody fixes and tests ceph for
32bit (or does this for Debian, also fine for me - running the
regression test suite is possible with enough resources and some
hardware), I will remove all 32bit architectures with the next upload.


I guess those are not the news you wanted to hear, but so fard thats the
situation.


Bernd


On 5/27/20 10:54 AM, Ard van Breemen wrote:
> Hi,
> 
> On Tue, May 26, 2020 at 06:35:20PM +0200, Val Lorentz wrote:
>> Thanks for the tip.
>>
>> I just tried downgrading an OSD (armhf) and a monitor (amd64) to
>> 14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still
>> unable to communicate ("failed decoding of frame header:
>> buffer::bad_alloc").
>>
>> So this might be a different issue, although related.
> 
> Well, 14.2.7-~bpo something did work on my armhf osd cluster,
> with 2 mons running on armhf, and one on proxmox pve 6 running
> ceph 14.2.8 .
> What Already did not work was OSD's on AMD64 working together
> with a 2xarmhf and 1xamd64 mon setup.
> I had a lot of problems getting it to work at all, but I thought
> it was just my lack of knowledge at that time. 99% of the
> problems is with setting up the correct secrets, or in other
> words, the handling of the "keyrings". Even between amd64 and
> amd64 this has been buggy if I look at the release notes.
> Specifically 14.2.6 to 14.2.7 I think.
> I assume bugs are in authentication, because as long as I did not
> reboot the amd64 it works.
> The daemons authenticate using the secrets, and the secret gives
> an authentication ticket.
> 
> Anyway: the most simple test is to install a system, rsync
> /etc/ceph and type in ceph status. It either works (on 32 bits,
> fix the timeout in the python script, because if you don't it
> won't work at all) or it doesn't return at all.
> 
> I will test if it's also the case with armhf ceph cli client to a
> amd64 cluster. I only have one working amd64 cluster though, and
> it has 2 fake OSD's, because amd64 clusters are too expensive to
> experiment with.
> I have to do some networking hacks though to connect the systems.
> 
> Anyway: the kernel has no problem talking to either OSD types, so
> the kernel's protocol handling is implemented correctly, and
> cephx works between an rbd amd64 or armhf kernel client and armhf
> userspace.
> The rbd amd64 userspace utility however does not work at all. As
> far as I can see it can't get past authentication, but without
> any logs I am a bit riddled.
> 
> By the way: the mgr dashboard modules is about 99% correct. The
> disk space is obviously calculated incorrectly.
> 
> Regards,
> Ard
> 

-- 
 Bernd ZeimetzDebian GNU/Linux Developer
 http://bzed.dehttp://www.debian.org
 GPG Fingerprint: ECA1 E3F2 8E11 2432 D485  DD95 EB36 171A 6FF9 435F



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-05-27 Thread Ard van Breemen
Hi,

On Tue, May 26, 2020 at 06:35:20PM +0200, Val Lorentz wrote:
> Thanks for the tip.
> 
> I just tried downgrading an OSD (armhf) and a monitor (amd64) to
> 14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still
> unable to communicate ("failed decoding of frame header:
> buffer::bad_alloc").
> 
> So this might be a different issue, although related.

Well, 14.2.7-~bpo something did work on my armhf osd cluster,
with 2 mons running on armhf, and one on proxmox pve 6 running
ceph 14.2.8 .
What Already did not work was OSD's on AMD64 working together
with a 2xarmhf and 1xamd64 mon setup.
I had a lot of problems getting it to work at all, but I thought
it was just my lack of knowledge at that time. 99% of the
problems is with setting up the correct secrets, or in other
words, the handling of the "keyrings". Even between amd64 and
amd64 this has been buggy if I look at the release notes.
Specifically 14.2.6 to 14.2.7 I think.
I assume bugs are in authentication, because as long as I did not
reboot the amd64 it works.
The daemons authenticate using the secrets, and the secret gives
an authentication ticket.

Anyway: the most simple test is to install a system, rsync
/etc/ceph and type in ceph status. It either works (on 32 bits,
fix the timeout in the python script, because if you don't it
won't work at all) or it doesn't return at all.

I will test if it's also the case with armhf ceph cli client to a
amd64 cluster. I only have one working amd64 cluster though, and
it has 2 fake OSD's, because amd64 clusters are too expensive to
experiment with.
I have to do some networking hacks though to connect the systems.

Anyway: the kernel has no problem talking to either OSD types, so
the kernel's protocol handling is implemented correctly, and
cephx works between an rbd amd64 or armhf kernel client and armhf
userspace.
The rbd amd64 userspace utility however does not work at all. As
far as I can see it can't get past authentication, but without
any logs I am a bit riddled.

By the way: the mgr dashboard modules is about 99% correct. The
disk space is obviously calculated incorrectly.

Regards,
Ard

-- 
.signature not found



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-05-26 Thread Val Lorentz
Thanks for the tip.

I just tried downgrading an OSD (armhf) and a monitor (amd64) to
14.2.7-1~bpo10+1 using http://snapshot.debian.org/ ; but they are still
unable to communicate ("failed decoding of frame header:
buffer::bad_alloc").

So this might be a different issue, although related.



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-05-26 Thread Ard van Breemen

Hi Guys,

I've had working OSD's on armhf using 14.2.7 fixed using the workaround 
from #956293.
The OSD and mon worked on armhf 14.2.7 and amd64 14.2.8 (proxmox 
install).
When I upgraded the 14.2.7 cluster to 14.2.9, everything still worked, 
until I rebooted the proxmox server.

Everything since then just went sauer.

So: I have a complete working ceph cluster on 14.2.9 running on arm. 
ceph status works.
Mapping rbd using echo to the /sys/bus/rbd/add_single_major works (using 
the username, key and monitors from ceph.conf) on kernel 5.6.11 amd64 
and any other kernel (armhf or whatever).

So, the ceph cluster works and the protocol is still correct.

However as soon as I just want to do a ceph status on an amd64, I get an 
indefinite hanging ceph command line. No way to trace that (please tell 
me how).

This problem is limited to amd64 though.
When I install ceph on an i386 image, connecting to the ceph cluster 
works and the cluster is healthy.


So protocol wise amd64 kernel works with 32 bits clusters. But amd64 
user space does not work with 32 bits clusters.
This might be somewhere in the authentication chain, as 14.2.9 was 
working (as far as I know) until I rebooted the 64 bit system.

And I think that last CVE fix might be the problem.

Anyway, I hope this reaches someone...
Regards,
Ard van Breemen



Bug#961481: ceph: Protocol incompatibility between armhf and amd64

2020-05-24 Thread Val Lorentz
Package: ceph
Version: 14.2.9-1~bpo10+1

Dear maintainers,

I run a cluster made of armhf and amd64 OSDs, and amd64 monitors and
manager.

I recently updated my cluster from Luminous (12, in buster) to Nautilus
(14, in buster-backports), following the instructions here:
https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous

At some point (and after hot-fixing for #956293 on armhf machines), I
noticed something was off, as my OSDs kept flipping between up and down,
with all machines of one arch up and the others down.

Eventually, the armhf went down definitively down (in the monitors'
view). (This might be when I enabled msgr2, but I do not remember the
exact timing.)

Starting one of the armhf OSDs causes this kind of line to appear in
monitors' logs:

2020-05-25 02:07:55.681 7f142df5b700 -1 --2-
[v2:[fdfc:0:0:2::e]:3300/0,v1:[fdfc:0:0:2::e]:6789/0] >>
conn(0x55f003781a80 0x55f004589b80 unknown :-1 s=HELLO_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0).run_continuation failed decoding of frame header:
buffer::bad_alloc


Moving the disk and config from an armhf to an arm64 machine fixes the
issue.