I’m experimenting with querying databases of JSON documents. These data-sets 
are schemaless and there’s no guarantee that they all have a common set of 
properties; in fact it’s common for them to have the equivalent of multiple 
‘tables’ in the same data-set, i.e. groups of documents with distinct sets of 
common properties.

Unindexed queries just using SELECT with json_extract() are faster than I 
expected, and of course they become blazing fast when I create indexes on those 
json_extract() calls. Great stuff!

Here’s my issue/question: It’s common for a JSON property to exist in only a 
subset of the rows/documents. This seems like a perfect use case for partial 
indexes … except that the WHERE clause in a CREATE INDEX statement explicitly 
disallows function calls, so I can’t constrain the index to only the rows that 
contain the JSON property. Is this limitation something that might be lifted 
soon (now that functions can be identified as ‘pure’), or is it somehow a 
direct consequence of the way partial indexes work?

To give an example: Let’s say the data-set contains JSON documents describing 
both students and courses. There are 20,000 students but only 200 courses being 
taught. Courses have a “professor” property. I would like to query by 
professor, but only 1% of the rows contain that property, so unless I constrain 
the index it’s going to be 99% full of useless null values. I’d like to say:
        CREATE INDEX profs ON dataset (json_extract(doc, ‘$.professor’)) WHERE 
json_extract(doc, ‘$.professor’) IS NOT NULL
but SQLite rejects this because of the function call in the WHERE clause.

(Now, maybe I’m prematurely optimizing, and in reality all those null values 
don’t have much overhead, or are suppressed entirely; I don’t know.)

I can think of some workarounds for this, but they require changing the SQL 
schema, for instance adding a boolean ‘has_professor’ column that's accessible 
when indexing. This can be awkward because I’m writing general-purpose code 
that doesn’t know ahead of time what sort of JSON data-sets might be loaded 
into it, or what queries might be run. So it would have to alter the tables on 
the fly as the caller makes queries or requests indexing.

Any advice gratefully appreciated. (I’ve looked through the list archives to 
see if this has come up before, but I couldn’t find any threads.)

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to