On 3/16/06, Daniel Franke <[EMAIL PROTECTED]> wrote: > Each single file contains detailed genotypic information of many > individuals at a given genomic region. We have to implement _loads_ of > quality control measures to ensure the maximum possible data > correctness. Earlier, we did this manually. We can't do this any > longer. Handling that many files is nightmare, especially if a couple > of different individuals are involved. For example, it is horrible to > extract subsets from that mess (a few markers from a couple of > individuals is a major problem with many, many files, but easily > solved in a relational db). >
In sqlite you can attach two databases and query across them. Have you considered leaving it as many databases for each genomic region and writing a tool that attaches what it needs to perform it's query? I would bet sqlite could handle the 20 gig file, but it will take a big machine with a lot of RAM and disk space.