Hi!

This is getting a bit off topic.

On Thu, Oct 17, 2019 at 12:07 PM Simon Slavin <slav...@bigfraud.org> wrote:
> 1) Almost no piece of software can handle a grid 2 billion cells wide.  Excel 
> maxes out at 16,384 columns.  Matlab can store and retrieve a cell of data 
> directly from a file, but it has a max array size of 10000.  R maxes out at 
> 2147483647, which is more than 2 billion.  But R has to hold all the data 
> from a matrix in memory at once and it can't assign enough memory to one 
> object to hold that many cells.

Of course, 2 billion is a lot. But 100k is something many ML libraries
support. Pandas, ndarray, R. Not something to magical about that.

> 2) Object names are not data.  They're descriptions in your favourite human 
> language.  They're not meant to have weird sequences of characters in.

Not sure what this relates to.

> 3) Lots of CSV import filters ignore a column header row, or can only create 
> fieldnames with certain limits (max length, no punctuation characters, etc.). 
>  So you should expect to lose fieldnames if you try to import your data into 
> some new piece of software.

Does SQLite have limitations on what can be a column name? If not,
then I would not worry what some CSV importers do. We would use a good
one to convert to SQLLite.

> (4) SQLite stores all the data for a row is together, in a sequence.  If you 
> ask for the data in the 3756th column of a row, SQLite has to read and parse 
> the data for the first 3755 columns of that row, just to read a single value 
> from storage.  As you can imagine, this is slow and involves a lot of I/O.  
> And while it happens the row up to that point must all be held in memory.  
> Consequently, nobody who uses SQLite for its intended purpose actually does 
> this.  I dread to think how slow random access over 2 billion columns would 
> be in SQLite.

I wrote earlier that for us use case where we are reading whole rows
is the most common one.

> Your gene expressions are data.  They are not the names of table entities.  
> They should be stored in a table as other posts suggested.

Maybe. But often this data is represented as a row of expressions with
columns for each gene. Because this is what is being distributed, we
are looking for ways to store this in a stable format which will be
supported for next 50 years, without modifying to original data too
much. I do hear suggestions to do such transformation, but that is
less ideal for our use case.


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to