To add to Dominique's suggestion, we use this approach a lot and have sort of standardized it internally. Of course 1NF dictates that this is not the real RDBMS way, but sometimes you need blobs because you just do.

I'm sure you already have figured out how to do it sans blob collations, and the question is only aimed at the collations and not needing alternate ideas, but on the off chance - here goes:

What we do is have blobs compressed if needed[1], then translate it to string using good old 3 to 4 translation (aka Base64)[2] which is rather efficient (much more so than bin --> Hex which is 1 to 2, doubling the size), then append an ante to the string "blob" that is a unique Key of the Blob which is also what it is sorted by (and a couple more characters that say a: whether it is compressed and b: the Blob version so that future software changes remain able to read the current blob architecture).

From there it's an easy step to make a standardized blob-storage thingy with pack/unpack to some structure, JSON, etc.  We even have these as UDFs, though I must say, in practice we never really use the UDF version since the content of the blob never really comes into play in SQL - if it does, we'd rather add real schema columns for that bit of the information.[3]

If sorting is all you need from all this, it's rather easy to also simply add another column to the table that contains the sort-by key and use that column whenever you query-sort it or compare it.

Lastly, you must already be aware, but just in case someone else is reading this: If you do have more than one column, place the Blob column at the end of the table definition.


Cheers,
Ryan

[1]: We start compressing if Len > 2560 bytes - we just found this number through running some research, a different figure might work for you. The compression algorithm varies for us between maximizing size saving vs. speed, and we can see a saving before 2560 bytes, but it seems for the general case, the timing of the total Base64.decode(compressed data)+LZ.decompress() starts being sporadically faster than just Base64.decompress(non-compressed data) times after about 3 thousand bytes, so even though a size saving  starts under 1K, a speed saving only starts around 3K+. Also, in case anyone wonders, Base64 compresses really badly - there is no point in first doing Base64 then compressing, it's always a loss. Mind you, this was long ago, perhaps I will find the code, modernize it and run tests again, some things may have changed.

[2]: As an aside, because of the only problem really being the null character in strings, we've designed another Base64-like translation that in stead of 8-bit to 6-bit (3x8bits --> 4x6bit) does 8bit to 7bit (7x8bit to 8x7bit) using a full 128-character palette plus a few control characters, which is not hard to find in a 256 character ASCII range (well, about 220-ish usable characters) and gives a great 7/8 size ratio (as opposed to 3/4) but it breaks UTF8 encoding (because of bit 7) so you must never load the stored strings through a UTF8 translation. In fact we've decided after some work on this that the size saving from 3/4 to 7/8 (or put another way: 25% waste down to 12.5% waste) is just not worth the complication and non-standard translation. Perhaps one day when we have blobs in the 50-megabytes+, but then binding them as actual BLOBs and adding a couple of other control columns would be the better answer.

[3]: String storage has a hard limit of ~2Gigs in some systems, though you should have opted for another way of storage very long before you get to this limit.


On 2019/04/27 12:35 PM, Dominique Devienne wrote:
On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke <j...@mooseyard.com> wrote:

We are using SQLite blobs to store some structured values, and need
control over how they are collated in queries, i.e. memcmp is not the
correct ordering. We’ve registered a custom collating function, but
unfortunately it doesn’t get called. According to the docs, collating
functions are only used to compare strings, not any other data type. Is
there any way around this limitation?

The only workaround I can think of is to define a custom function that
converts a blob to a string and collate using those strings — e.g. `… ORDER
BY collatable_blob(b)`. But this requires expensive string conversions, and
it doesn’t work well with indexes.

It would be much cleaner and more efficient if there were a type of
collating function that operated on all data types.

You are hitting what is IMHO one of the last big remaining "hole" in
SQLite, which is its lack of User-Defined Types (UDTs) for persistent
columns.
(for transient values in function call chains, we have subtypes nowadays,
which is as close as a UDT as SQLite as).

Your blob obviously has internal structure, and while you're happy to store
it as an opaque blob, you'd like to have it sort differently.
You can of course work-around having a real UDT by assigning a custom
collation to the column, but really it belongs on the "type", not the
column.

Coming back on your more specific question, aside from DRH adding support
for User-Defined Collations to blob columns (a long shot I suspect),
you could store your "structured values" as strings instead of blobs.
AFAIK, SQLite does not check the string is valid UTF8, so can store
arbitrary binary too.
As long as you do not perform encoding conversions UTF8 <-> UTF16, and you
do not call SQL functions assuming valid UTF8, you should be fine.
And now you can have your custom collation.

Note friendly to users of the DB that might see garbage when printing those
"binary" strings though.
But at this point, that's the only thing you can do that avoid the
expensive conversions you pointed out. --DD
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to