Sqlite3 has something of a normative declaration in its source code:

*
** This is the maximum number of
**
**    * Columns in a table
**    * Columns in an index
**    * Columns in a view
**    * Terms in the SET clause of an UPDATE statement
**    * Terms in the result set of a SELECT statement
**    * Terms in the GROUP BY or ORDER BY clauses of a SELECT statement.
**    * Terms in the VALUES clause of an INSERT statement
**
** The hard upper limit here is 32676.  Most database people will
** tell you that in a well-normalized database, you usually should
** not have more than a dozen or so columns in any table.  And if
** that is the case, there is no point in having more than a few
** dozen values in any of the other situations described above.
*/
#ifndef SQLITE_MAX_COLUMN
# define SQLITE_MAX_COLUMN 2000
#endif

All software has constraints fundamental to its problem set and particular
implementation, and I would not mail you simply for deciding an
architecture on behalf of your users.  You do that every day.

However, conditions have changed since (I expect) this design was
specified.  One of the more useful and usable packages for Natural Language
Processing, Magnitude[1], leverages SQLite to efficiently handle the real
valued but entirely abstract collections of numbers -- vector spaces --
that modern machine learning depends on.

Traditionally, word2vec and other language vector approaches have fit their
models into tens or hundreds of dimensions.  New work is acquiring the
context of each word -- "Huntington", preceeded by "Mr.", vs. "Huntington"
preceeded by "Street" or followed by Disease.  These phrase or sentence
embeddings (as they're called) can take quite a bit more space to represent
necessary context.  They may take 3072 columns (in the case of Magnitude's
ELMo records[2]) or even 4096 columns (in the case of InferSent's sentence
embeddings[3]).

That is exceeding SQLite's limits.

These spaces, though abstract, have become the most powerful way we know to
not merely represent, but actually discover relationships.  This is simply
a new data domain that is the *input* to what eventually becomes the
familiarly normalizable relational domain.  It's different, but it's not
"wrong".  If it's behavior you can support -- as your 32K hard limit
implies -- it would certainly be helpful for these user scenarios.

I spent quite a bit of time hacking large column support into a working
Python pipeline, and I'd prefer never to run that in production.
Converting this compile time variable into a runtime knob would be
appreciated.

--Dan

[1] https://github.com/plasticityai/magnitude
[2] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
[3] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to