Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-21 Thread Igor Fedotov


On 1/18/2019 6:33 PM, KEVIN MICHAEL HRPCEK wrote:



On 1/18/19 7:26 AM, Igor Fedotov wrote:


Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:

Hey,

I recall reading about this somewhere but I can't find it in the 
docs or list archive and confirmation from a dev or someone who 
knows for sure would be nice. What I recall is that bluestore has a 
max 4GB file size limit based on the design of bluestore not the 
osd_max_object_size setting. The bluestore source seems to suggest 
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error 
if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the 
data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in 
osd file size int can't exceed 32 bits which is 4GB, like FAT32. Am 
I correct or maybe I'm reading all this wrong..?


You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)



Thanks for confirming that!





If bluestore has a hard 4GB object limit using radosstriper to break 
up an object would work, but does using an EC pool that breaks up 
the object to shards smaller than OBJECT_MAX_SIZE have the same 
effect as radosstriper to get around a 4GB limit? We use rados 
directly and would like to move to bluestore but we have some large 
objects <= 13G that may need attention if this 4GB limit does exist 
and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure 
whether one needs to adjust osd_max_object_size greater than 4Gb to 
permit 13Gb object usage in EC pool. If it's needed than 
tosd_max_object_size <= OBJECT_MAX_SIZE constraint is violated and 
BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M 
default it changed to a couple versions ago to ~20G to be able to 
write our largest objects with some margin. Do you think there is 
another way to handle osd_max_object_size > OBJECT_MAX_SIZE so that 
bluestore will start and EC pools or striping can be used to write 
objects that are greater than OBJECT_MAX_SIZE but each stripe/shard 
ends up smaller than OBJECT_MAX_SIZE after striping or being in an ec 
pool?


I'm not very familiar with the logic osd_max_object_size controls at OSD 
level. But IMO there are might be two logically valid options:


1) This is maximum user (RADOS?)  object size. In this case verification 
at BlueStore is a bit incorrect as EC might be in the path and hence one 
can still have 4+ GB object stored. If that's the case then it's just 
enough to remove the corresponding assertion at BlueStore.


2) This is maximum object size provided to Object store. Then one should 
be able to upload object longer than this threshold using EC.


I'm going to verify this behavior and come up with corresponding fixes 
if any shortly.


Unfortunately in short term I don't see any workarounds for your case  
other than having a custom build that has assertion at BlueStore removed.







https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
  // sanity check(s)
   auto osd_max_object_size =
 cct->_conf.get_val("osd_max_object_size");
   if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
 derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
   << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  std::dec 
<< dendl;
 return -EINVAL;
   }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
   if (offset + length >= OBJECT_MAX_SIZE) {
 r = -E2BIG;
   } else {
 _assign_nid(txc, o);
 r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
 txc->write_onode(o);
   }

Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread KEVIN MICHAEL HRPCEK


On 1/18/19 7:26 AM, Igor Fedotov wrote:

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)

Thanks for confirming that!


If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether one 
needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb object 
usage in EC pool. If it's needed than tosd_max_object_size <= OBJECT_MAX_SIZE 
constraint is violated and BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M default it 
changed to a couple versions ago to ~20G to be able to write our largest 
objects with some margin. Do you think there is another way to handle 
osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will start and EC pools 
or striping can be used to write objects that are greater than OBJECT_MAX_SIZE 
but each stripe/shard ends up smaller than OBJECT_MAX_SIZE after striping or 
being in an ec pool?



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread Igor Fedotov

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:

Hey,

I recall reading about this somewhere but I can't find it in the docs 
or list archive and confirmation from a dev or someone who knows for 
sure would be nice. What I recall is that bluestore has a max 4GB file 
size limit based on the design of bluestore not the 
osd_max_object_size setting. The bluestore source seems to suggest 
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error if 
osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the data if 
offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct 
or maybe I'm reading all this wrong..?


You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)





If bluestore has a hard 4GB object limit using radosstriper to break 
up an object would work, but does using an EC pool that breaks up the 
object to shards smaller than OBJECT_MAX_SIZE have the same effect as 
radosstriper to get around a 4GB limit? We use rados directly and 
would like to move to bluestore but we have some large objects <= 13G 
that may need attention if this 4GB limit does exist and an ec pool 
doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether 
one needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb 
object usage in EC pool. If it's needed than tosd_max_object_size <= 
OBJECT_MAX_SIZE constraint is violated and BlueStore wouldn't start.



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
  // sanity check(s)
   auto osd_max_object_size =
 cct->_conf.get_val("osd_max_object_size");
   if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
 derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
   << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  std::dec 
<< dendl;
 return -EINVAL;
   }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
   if (offset + length >= OBJECT_MAX_SIZE) {
 r = -E2BIG;
   } else {
 _assign_nid(txc, o);
 r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
 txc->write_onode(o);
   }

Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore 32bit max_object_size limit

2019-01-17 Thread KEVIN MICHAEL HRPCEK
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com