> The table in question:
> CREATE TABLE genotypes (markerid integer NOT NULL REFERENCES marker(id),
>                         individualid integer NOT NULL REFERENCES
> individuals(id),
>                         genA integer,
>                         genB integer);
>
> I don't see how to segment this. I get 1.000+ individuals and up to
> 500.000 markers. In a full cross of both, I get 500.000.000+ rows. It
> should be possible to use a table per marker or a table per individual
> -- but this would strike me odd. Even if this table is split up per
> individual (within the same dbfile), say, I fail to identify any gain

The way to segment data is by usage. At Sprint they grouped cell phone
calling records by billing cycle. Their system generates LOTS of raw data.
Once a billing cycle was completed they had
no need to query data from multiple billing cycles. The previous billing cycle
data would be summarized and that summary info kept longer term. The detailed
records could then be archived and not kept in active storage.

How do you use your data? Do you really need to compare all the information
about any and all individuals?



> Jay wrote
> > In sqlite you can attach two databases and query across them.
> > Have you considered leaving it as many databases for each genomic
> > region and writing a tool that attaches what it needs to perform it's query?
> I'm not sure whether I understand you correctly. You suggest to employ
> as many small
> databases as there are flat files? To attach and detach them as needed?

Group data in whatever is a logical 'lump' for your particular usage.
Say it's by "region". If you need to compare two regions attach the two
region databases and you can do your queries.

You may legitimately need one really large table but most applications don't.

Reply via email to