Re: [sqlite] Custom collation of blobs

R Smith Sat, 27 Apr 2019 06:22:38 -0700

To add to Dominique's suggestion, we use this approach a lot and havesort of standardized it internally. Of course 1NF dictates that this isnot the real RDBMS way, but sometimes you need blobs because you just do.

I'm sure you already have figured out how to do it sans blob collations,and the question is only aimed at the collations and not needingalternate ideas, but on the off chance - here goes:

What we do is have blobs compressed if needed[1], then translate it tostring using good old 3 to 4 translation (aka Base64)[2] which is ratherefficient (much more so than bin --> Hex which is 1 to 2, doubling thesize), then append an ante to the string "blob" that is a unique Key ofthe Blob which is also what it is sorted by (and a couple morecharacters that say a: whether it is compressed and b: the Blob versionso that future software changes remain able to read the current blobarchitecture).

From there it's an easy step to make a standardized blob-storage thingywith pack/unpack to some structure, JSON, etc. We even have these asUDFs, though I must say, in practice we never really use the UDF versionsince the content of the blob never really comes into play in SQL - ifit does, we'd rather add real schema columns for that bit of theinformation.[3]

If sorting is all you need from all this, it's rather easy to alsosimply add another column to the table that contains the sort-by key anduse that column whenever you query-sort it or compare it.

Lastly, you must already be aware, but just in case someone else isreading this: If you do have more than one column, place the Blob columnat the end of the table definition.



Cheers,
Ryan

[1]: We start compressing if Len > 2560 bytes - we just found thisnumber through running some research, a different figure might work foryou. The compression algorithm varies for us between maximizing sizesaving vs. speed, and we can see a saving before 2560 bytes, but itseems for the general case, the timing of the totalBase64.decode(compressed data)+LZ.decompress() starts being sporadicallyfaster than just Base64.decompress(non-compressed data) times afterabout 3 thousand bytes, so even though a size saving starts under 1K, aspeed saving only starts around 3K+. Also, in case anyone wonders,Base64 compresses really badly - there is no point in first doing Base64then compressing, it's always a loss.Mind you, this was long ago, perhaps I will find the code, modernize itand run tests again, some things may have changed.

[2]: As an aside, because of the only problem really being the nullcharacter in strings, we've designed another Base64-like translationthat in stead of 8-bit to 6-bit (3x8bits --> 4x6bit) does 8bit to 7bit(7x8bit to 8x7bit) using a full 128-character palette plus a few controlcharacters, which is not hard to find in a 256 character ASCII range(well, about 220-ish usable characters) and gives a great 7/8 size ratio(as opposed to 3/4) but it breaks UTF8 encoding (because of bit 7) soyou must never load the stored strings through a UTF8 translation. Infact we've decided after some work on this that the size saving from 3/4to 7/8 (or put another way: 25% waste down to 12.5% waste) is just notworth the complication and non-standard translation. Perhaps one daywhen we have blobs in the 50-megabytes+, but then binding them as actualBLOBs and adding a couple of other control columns would be the betteranswer.

[3]: String storage has a hard limit of ~2Gigs in some systems, thoughyou should have opted for another way of storage very long before youget to this limit.



On 2019/04/27 12:35 PM, Dominique Devienne wrote:

On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke <j...@mooseyard.com> wrote:

We are using SQLite blobs to store some structured values, and need
control over how they are collated in queries, i.e. memcmp is not the
correct ordering. We’ve registered a custom collating function, but
unfortunately it doesn’t get called. According to the docs, collating
functions are only used to compare strings, not any other data type. Is
there any way around this limitation?

The only workaround I can think of is to define a custom function that
converts a blob to a string and collate using those strings — e.g. `… ORDER
BY collatable_blob(b)`. But this requires expensive string conversions, and
it doesn’t work well with indexes.

It would be much cleaner and more efficient if there were a type of
collating function that operated on all data types.

You are hitting what is IMHO one of the last big remaining "hole" in
SQLite, which is its lack of User-Defined Types (UDTs) for persistent
columns.
(for transient values in function call chains, we have subtypes nowadays,
which is as close as a UDT as SQLite as).

Your blob obviously has internal structure, and while you're happy to store
it as an opaque blob, you'd like to have it sort differently.
You can of course work-around having a real UDT by assigning a custom
collation to the column, but really it belongs on the "type", not the
column.

Coming back on your more specific question, aside from DRH adding support
for User-Defined Collations to blob columns (a long shot I suspect),
you could store your "structured values" as strings instead of blobs.
AFAIK, SQLite does not check the string is valid UTF8, so can store
arbitrary binary too.
As long as you do not perform encoding conversions UTF8 <-> UTF16, and you
do not call SQL functions assuming valid UTF8, you should be fine.
And now you can have your custom collation.

Note friendly to users of the DB that might see garbage when printing those
"binary" strings though.
But at this point, that's the only thing you can do that avoid the
expensive conversions you pointed out. --DD
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Custom collation of blobs

Reply via email to